关于数据挖掘:R语言文本挖掘情感分析和可视化哈利波特小说文本数据附代码数据

最近咱们被客户要求撰写对于文本开掘的钻研报告，包含一些图形和统计输入。

一旦咱们清理了咱们的文本并进行了一些根本的词频剖析，下一步就是理解文本中的观点或情感。这被认为是情感剖析，本教程将疏导你通过一个简略的办法来进行情感剖析。

本教程是对情感剖析的一个介绍。本教程建设在 tidy text 教程的根底上，所以如果你没有读过该教程，我倡议你从那里开始。在本教程中，我包含以下内容。

要求：重现本教程中的剖析须要什么？
情感数据集：用来对情感进行评分的次要数据集
根本情感剖析：执行根本的情感剖析
比拟情感：比拟情感库中的情感差别
常见的情绪词：找出最常见的踊跃和消极词汇
大单元的情感剖析：在较大的文本单元中剖析情感，而不是单个词。

本教程利用了 harrypotter 文本数据，以阐明文本开掘和剖析能力。

library(tidyverse) # 数据处理和绘图
library(stringr) # 文本清理和正则表达式
library(tidytext) # 提供额定的文本开掘性能

咱们正在解决的七部小说，包含

philosophers_stone：《哈利 - 波特与魔法石》（1997）。
chamber_of_secrets:《哈利 - 波特与密室》(1998)
阿兹卡班的囚徒（prisoner_of_azkaban）。Harry Potter and the Prisoner of Azkaban (1999)
Goblet_of_fire: Harry Potter and the Goblet of Fire (2000)
Order_of_the_phoenix: Harry Potter and the Order of the Phoenix (2003)
half_blood_prince: 哈利 - 波特与混血王子 (2005)
deathly_hallows: 哈利 - 波特与死亡圣器（2007）。

每个文本都在一个字符矢量中，每个元素代表一个章节。例如，上面阐明了 philosophers_stone 的前两章的原始文本。

philosophers_stone[1:2]
## [1] "THE BOY WHO LIVED  Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank
## you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold
## with such nonsense.  Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly
## any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck,
## which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a
## small son called Dudley and in their opinion there was no finer boy anywhere.  The Dursleys had everything they wanted, but they also
## had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out
## about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn'... <truncated>
## [2] "THE VANISHING GLASS  Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but
## Privet Drive had hardly changed at all. The sun rose on the same tidy front gardens and lit up the brass number four on the Dursleys'
## front door; it crept into their living room, which was almost exactly the same as it had been on the night when Mr. Dursley had seen
## that fateful news report about the owls. Only the photographs on the mantelpiece really showed how much time had passed. Ten years ago,
## there had been lots of pictures of what looked like a large pink beach ball wearing different-colored bonnets -- but Dudley Dursley was
## no longer a baby, and now the photographs showed a large blond boy riding his first bicycle, on a carousel at the fair, playing a
## computer game with his father, being hugged and kissed by his mother. The room held no sign at all that another boy lived in the house,
## too.  Yet Harry Potter was still there, asleep at the moment, but no... <truncated>

有各种各样的字典存在，用于评估文本中的观点或情感。tidytext 包在 sentiments 数据集中蕴含了三个情感词典。

sentiments
## # A tibble: 23,165 × 4
##           word sentiment lexicon score
##          <chr>     <chr>   <chr> <int>
## 1       abacus     trust     nrc    NA
## 2      abandon      fear     nrc    NA
## 3      abandon  negative     nrc    NA
## 4      abandon   sadness     nrc    NA
## 5    abandoned     anger     nrc    NA
## 6    abandoned      fear     nrc    NA
## 7    abandoned  negative     nrc    NA
## 8    abandoned   sadness     nrc    NA
## 9  abandonment     anger     nrc    NA
## 10 abandonment      fear     nrc    NA
## # ... with 23,155 more rows

这三个词库是

AFINN
bing
nrc

这三个词库都是基于单字（或单词）的。这些词库蕴含了许多英语单词，这些单词被调配了踊跃 / 消极情绪的分数，也可能是高兴、愤恨、悲伤等情绪的分数。nrc 词典以二元形式（” 是 ”/” 否 ”）将单词分为踊跃、消极、愤恨、期待、讨厌、恐怖、高兴、悲伤、诧异和信赖等类别。bing 词库以二元形式将单词分为踊跃和消极类别。AFINN 词库给单词打分，分数在 - 5 到 5 之间，负分示意消极情绪，正分示意积极情绪。

# 查看单个词库
get_sentiments("afinn")
get_sentiments("bing")
get_sentiments("nrc")

为了进行情感剖析，咱们须要将咱们的数据整顿成一个参差的格局。上面将所有七本《哈利 - 波特》小说转换为一个 tibble，其中每个词都按章节按书排列。更多细节请参见整洁文本教程。

# 设定因素，按出版程序保留书籍
series$book <- factor(series$book, levels = rev(titles))

series
## # A tibble: 1,089,386 × 3
##                   book chapter    word
## *               <fctr>   <int>   <chr>
## 1  Philosopher's Stone       1     the
## 2  Philosopher's Stone       1     boy
## 3  Philosopher's Stone       1     who
## 4  Philosopher's Stone       1   lived
## 5  Philosopher's Stone       1      mr
## 6  Philosopher's Stone       1     and
## 7  Philosopher's Stone       1     mrs
## 8  Philosopher's Stone       1 dursley
## 9  Philosopher's Stone       1      of
## 10 Philosopher's Stone       1  number
## # ... with 1,089,376 more rows

当初让咱们应用 nrc 情感数据集来评估整个《哈利 - 波特》系列所代表的不同情感。咱们能够看到，负面情绪的存在比侧面情绪更强烈。

        filter(!is.na(sentiment)) %>%
        count(sentiment, sort = TRUE)

<!—->

## # A tibble: 10 × 2
##       sentiment     n
##           <chr> <int>
## 1      negative 56579
## 2      positive 38324
## 3       sadness 35866
## 4         anger 32750
## 5         trust 23485
## 6          fear 21544
## 7  anticipation 21123
## 8           joy 14298
## 9       disgust 13381
## 10     surprise 12991

这给出了一个很好的整体感觉，但如果咱们想理解每部小说的过程中情绪是如何变动的呢？要做到这一点，咱们要进行以下工作。

创立一个索引，将每本书按 500 个词离开；这是每两页的大抵字数，所以这将使咱们可能评估情绪的变动，甚至是在章节中的变动。
用 inner\_join 连贯 bing 词典，以评估每个词的侧面和负面情绪。
计算每两页有多少个侧面和负面的词
扩散咱们的数据
计算出净情绪（侧面 - 负面）。
绘制咱们的数据

<!—->

        ggplot(aes(index, sentiment, fill = book)) +
          geom_bar(alpha = 0.5")

当初咱们能够看到每部小说的情节是如何在故事的倒退轨迹中朝着更踊跃或更消极的情绪变动。

点击题目查阅往期内容

主题开掘 LDA 和情感剖析图书馆话题知乎用户问答行为数据

左右滑动查看更多

有了情感词典的几种抉择，你可能想理解更多对于哪一种适宜你的目标的信息。让咱们应用所有三种情感词典，并查看它们对每部小说的不同之处。

        summarise(sentiment = sum(score)) %>%
        mutate(method = "AFINN")

bing_and_nrc <-
                  inner_join(get_sentiments("nrc") %>%
                                     filter(sentiment %in% c("positive", "negative"))) %>%
              
        spread(sentiment, n, fill = 0) %>%

咱们当初有了对每个情感词库的小说文本中净情感（侧面 - 负面）的预计。让咱们把它们绘制进去。

  ggplot(aes(index, sentiment, fill = method)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_grid(book ~ method)

计算情感的三种不同的词典给出的后果在相对意义上是不同的，但在小说中却有相当类似的绝对轨迹。咱们看到在小说中差不多雷同的中央有相似的情绪低谷和顶峰，但绝对值却显著不同。在某些状况下，AFINN 词典仿佛比 NRC 词典发现了更多踊跃的情绪。这个输入后果也使咱们可能在不同的小说之间进行比拟。首先，你能够很好地理解书籍长度的差别 –《菲尼克斯的秩序》比《哲学家的石头》长很多。其次，你能够比拟一个系列中的书籍在情感方面的不同。

同时领有情感和单词的数据框架的一个益处是，咱们能够剖析对每种情感有奉献的单词数。

word_counts
## # A tibble: 3,313 × 3
##      word sentiment     n
##     <chr>     <chr> <int>
## 1    like  positive  2416
## 2    well  positive  1969
## 3   right  positive  1643
## 4    good  positive  1065
## 5    dark  negative  1034
## 6   great  positive   877
## 7   death  negative   757
## 8   magic  positive   606
## 9  better  positive   533
## 10 enough  positive   509
## # ... with 3,303 more rows

咱们能够直观地查看，以评估每种情绪的前 n 个词。

        ggplot(aes(reorder(word, n), n, fill = sentiment)) +
          geom_bar(alpha = 0.8, stat = "identity"

很多有用的工作能够通过在词的层面上进行标记化来实现，但有时查看不同的文本单位是有用的或必要的。例如，一些情感剖析算法不仅仅关注单字（即单个单词），而是试图理解一个句子的整体情感。这些算法试图了解

我明天过的不开心。

是一个悲伤的句子，而不是一个高兴的句子，因为有否定词。斯坦福大学的 CoreNLP 工具是这类情感剖析算法的例子。对于这些，咱们可能想把文本标记为句子。我应用 philosophers\_stone 数据集来阐明。

tibble(text = philosophers_stone)
##                                                                       sentence
##                                                                          <chr>
## 1                                              the boy who lived  mr. and mrs.
## 2  dursley, of number four, privet drive, were proud to say that they were per
## 3  they were the last people you'd expect to be involved in anything strange o
## 4                                                                          mr.
## 5      dursley was the director of a firm called grunnings, which made drills.
## 6  he was a big, beefy man with hardly any neck, although he did have a very l
## 7                                                                         mrs.
## 8  dursley was thin and blonde and had nearly twice the usual amount of neck, 
## 9  the dursleys had a small son called dudley and in their opinion there was n
## 10 the dursleys had everything they wanted, but they also had a secret, and th
## # ... with 6,588 more rows

参数 token = “ 句子 “ 试图通过标点符号来宰割文本。

让咱们持续按章节和句子来合成 philosophers\_stone 文本。

                        text = philosophers_stone) %>% 
  unnest_tokens(sentence, text, token = "sentences")

这将使咱们可能按章节和句子来评估净情绪。首先，咱们须要追踪句子的编号，而后我创立一个索引，追踪每一章的进度。而后，我按字数对句子进行解嵌。这就给了咱们一个 tibble，其中有每一章中按句子分列的单个词。当初，像以前一样，我退出 AFINN 词典，并计算每一章的净情感分数。咱们能够看到，最踊跃的句子是第 9 章的一半，第 17 章的开端，第 4 章的晚期，等等。

        group_by(chapter, index) %>%
        summarise(sentiment = sum(score, na.rm = TRUE)) %>%
        arrange(desc(sentiment))


## Source: local data frame [1,401 x 3]
## Groups: chapter [17]
## 
##    chapter index sentiment
##      <int> <dbl>     <int>
## 1        9  0.47        14
## 2       17  0.91        13
## 3        4  0.11        12
## 4       12  0.45        12
## 5       17  0.54        12
## 6        1  0.25        11
## 7       10  0.04        11
## 8       10  0.16        11
## 9       11  0.48        11
## 10      12  0.70        11
## # ... with 1,391 more rows

咱们能够用一个热图来形象地阐明这一点，该热图显示了咱们在每一章的停顿中最踊跃和最消极的情绪。

ggplot(book_sent) +
        geom_tile(color = "white") +

点击文末 “浏览原文”

获取全文残缺材料。

本文选自《R 语言文本开掘、情感剖析和可视化哈利波特小说文本数据》。

点击题目查阅往期内容

R 语言之文本剖析: 主题建模 LDA\
R 语言中的 LDA 模型：对文本数据进行主题模型 topic modeling 剖析 \
自然语言解决 NLP：主题 LDA、情感剖析疫情下的新闻文本数据 \
【视频】文本开掘：主题模型（LDA）及 R 语言实现剖析游记数据 \
NLP 自然语言解决—主题模型 LDA 案例：开掘人民网留言板文本数据 \
Python 主题建模 LDA 模型、t-SNE 降维聚类、词云可视化文本开掘新闻组数据集 \
自然语言解决 NLP：主题 LDA、情感剖析疫情下的新闻文本数据 \
R 语言对 NASA 元数据进行文本开掘的主题建模剖析 \
R 语言文本开掘、情感剖析和可视化哈利波特小说文本数据 \
Python、R 对小说进行文本开掘和档次聚类可视化剖析案例 \
用于 NLP 的 Python：应用 Keras 进行深度学习文本生成 \
长短期记忆网络 LSTM 在工夫序列预测和文本分类中的利用 \
用 Rapidminer 做文本开掘的利用：情感剖析 \
R 语言文本开掘 tf-idf, 主题建模，情感剖析,n-gram 建模钻研 \
R 语言对推特 twitter 数据进行文本情感剖析 \
Python 应用神经网络进行简略文本分类 \
用于 NLP 的 Python：应用 Keras 的多标签文本 LSTM 神经网络分类 \
R 语言文本开掘应用 tf-idf 剖析 NASA 元数据的关键字 \
R 语言 NLP 案例：LDA 主题文本开掘优惠券举荐网站数据 \
Python 应用神经网络进行简略文本分类 \
R 语言自然语言解决（NLP）：情感剖析新闻文本数据 \
Python、R 对小说进行文本开掘和档次聚类可视化剖析案例 \
R 语言对推特 twitter 数据进行文本情感剖析 \
R 语言中的 LDA 模型：对文本数据进行主题模型 topic modeling 剖析 \
R 语言文本主题模型之潜在语义剖析（LDA:Latent Dirichlet Allocation）R 语言对 NASA 元数据进行文本开掘的主题建模剖析 \
R 语言文本开掘、情感剖析和可视化哈利波特小说文本数据 \
Python、R 对小说进行文本开掘和档次聚类可视化剖析案例 \
用于 NLP 的 Python：应用 Keras 进行深度学习文本生成 \
长短期记忆网络 LSTM 在工夫序列预测和文本分类中的利用 \
用 Rapidminer 做文本开掘的利用：情感剖析 \
R 语言文本开掘 tf-idf, 主题建模，情感剖析,n-gram 建模钻研 \
R 语言对推特 twitter 数据进行文本情感剖析 \
Python 应用神经网络进行简略文本分类 \
用于 NLP 的 Python：应用 Keras 的多标签文本 LSTM 神经网络分类 \
R 语言文本开掘应用 tf-idf 剖析 NASA 元数据的关键字 \
R 语言 NLP 案例：LDA 主题文本开掘优惠券举荐网站数据 \
Python 应用神经网络进行简略文本分类 \
R 语言自然语言解决（NLP）：情感剖析新闻文本数据 \
Python、R 对小说进行文本开掘和档次聚类可视化剖析案例 \
R 语言对推特 twitter 数据进行文本情感剖析 \
R 语言中的 LDA 模型：对文本数据进行主题模型 topic modeling 剖析 \
R 语言文本主题模型之潜在语义剖析（LDA:Latent Dirichlet Allocation）

关于数据挖掘:R语言文本挖掘情感分析和可视化哈利波特小说文本数据附代码数据

全文下载链接：http://tecdat.cn/?p=22984

简而言之

复制要求

情感数据集

根本情感剖析

比拟情感

常见情绪词

较大单位的情绪剖析

Just My Socks（注册教程内含优惠码）

关于数据挖掘:R语言文本挖掘情感分析和可视化哈利波特小说文本数据附代码数据

全文下载链接：http://tecdat.cn/?p=22984

简而言之

复制要求

情感数据集

根本情感剖析

比拟情感

常见情绪词

较大单位的情绪剖析

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）