本教程是对情感剖析的一个介绍。本教程建设在 tidy text 教程的根底上,所以如果你没有读过该教程,我倡议你从那里开始。在本教程中,我包含以下内容。
- 要求:重现本教程中的剖析须要什么?
- 情感数据集:用来对情感进行评分的次要数据集
- 根本情感剖析:执行根本的情感剖析
- 比拟情感:比拟情感库中的情感差别
- 常见的情绪词:找出最常见的踊跃和消极词汇
- 大单元的情感剖析:在较大的文本单元中剖析情感,而不是单个词。
本教程利用了 harrypotter 文本数据,以阐明文本开掘和剖析能力。
library(tidyverse) # 数据处理和绘图
library(stringr) # 文本清理和正则表达式
library(tidytext) # 提供额定的文本开掘性能
philosophers_stone:《哈利 - 波特与魔法石》(1997)。
chamber_of_secrets:《哈利 - 波特与密室》(1998)
阿兹卡班的囚徒(prisoner_of_azkaban)。Harry Potter and the Prisoner of Azkaban (1999)
Goblet_of_fire: Harry Potter and the Goblet of Fire (2000)
Order_of_the_phoenix: Harry Potter and the Order of the Phoenix (2003)
half_blood_prince: 哈利 - 波特与混血王子 (2005)
deathly_hallows: 哈利 - 波特与死亡圣器(2007)。
每个文本都在一个字符矢量中,每个元素代表一个章节。例如,上面阐明了 philosophers_stone 的前两章的原始文本。
## [1] "THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank
## you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold
## with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly
## any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck,
## which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a
## small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also
## had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out
## about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn'... <truncated>
## [2] "THE VANISHING GLASS Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but
## Privet Drive had hardly changed at all. The sun rose on the same tidy front gardens and lit up the brass number four on the Dursleys'
## front door; it crept into their living room, which was almost exactly the same as it had been on the night when Mr. Dursley had seen
## that fateful news report about the owls. Only the photographs on the mantelpiece really showed how much time had passed. Ten years ago,
## there had been lots of pictures of what looked like a large pink beach ball wearing different-colored bonnets -- but Dudley Dursley was
## no longer a baby, and now the photographs showed a large blond boy riding his first bicycle, on a carousel at the fair, playing a
## computer game with his father, being hugged and kissed by his mother. The room held no sign at all that another boy lived in the house,
## too. Yet Harry Potter was still there, asleep at the moment, but no... <truncated>
有各种各样的字典存在,用于评估文本中的观点或情感。tidytext 包在 sentiments 数据集中蕴含了三个情感词典。
## # A tibble: 23,165 × 4
## word sentiment lexicon score
## <chr> <chr> <chr> <int>
## 1 abacus trust nrc NA
## 2 abandon fear nrc NA
## 3 abandon negative nrc NA
## 4 abandon sadness nrc NA
## 5 abandoned anger nrc NA
## 6 abandoned fear nrc NA
## 7 abandoned negative nrc NA
## 8 abandoned sadness nrc NA
## 9 abandonment anger nrc NA
## 10 abandonment fear nrc NA
## # ... with 23,155 more rows
这三个词库都是基于单字(或单词)的。这些词库蕴含了许多英语单词,这些单词被调配了踊跃 / 消极情绪的分数,也可能是高兴、愤恨、悲伤等情绪的分数。nrc 词典以二元形式(” 是 ”/” 否 ”)将单词分为踊跃、消极、愤恨、期待、讨厌、恐怖、高兴、悲伤、诧异和信赖等类别。bing 词库以二元形式将单词分为踊跃和消极类别。AFINN 词库给单词打分,分数在 - 5 到 5 之间,负分示意消极情绪,正分示意积极情绪。
# 查看单个词库
为了进行情感剖析,咱们须要将咱们的数据整顿成一个参差的格局。上面将所有七本《哈利 - 波特》小说转换为一个 tibble,其中每个词都按章节按书排列。更多细节请参见整洁文本教程。
# 设定因素,按出版程序保留书籍
series$book <- factor(series$book, levels = rev(titles))
## # A tibble: 1,089,386 × 3
## book chapter word
## * <fctr> <int> <chr>
## 1 Philosopher's Stone 1 the
## 2 Philosopher's Stone 1 boy
## 3 Philosopher's Stone 1 who
## 4 Philosopher's Stone 1 lived
## 5 Philosopher's Stone 1 mr
## 6 Philosopher's Stone 1 and
## 7 Philosopher's Stone 1 mrs
## 8 Philosopher's Stone 1 dursley
## 9 Philosopher's Stone 1 of
## 10 Philosopher's Stone 1 number
## # ... with 1,089,376 more rows
当初让咱们应用 nrc 情感数据集来评估整个《哈利 - 波特》系列所代表的不同情感。咱们能够看到,负面情绪的存在比侧面情绪更强烈。
filter(!is.na(sentiment)) %>%
count(sentiment, sort = TRUE)
## # A tibble: 10 × 2
## sentiment n
## <chr> <int>
## 1 negative 56579
## 2 positive 38324
## 3 sadness 35866
## 4 anger 32750
## 5 trust 23485
## 6 fear 21544
## 7 anticipation 21123
## 8 joy 14298
## 9 disgust 13381
## 10 surprise 12991
- 创立一个索引,将每本书按 500 个词离开;这是每两页的大抵字数,所以这将使咱们可能评估情绪的变动,甚至是在章节中的变动。
- 用 inner\_join 连贯 bing 词典,以评估每个词的侧面和负面情绪。
- 计算每两页有多少个侧面和负面的词
- 扩散咱们的数据
- 计算出净情绪(侧面 - 负面)。
- 绘制咱们的数据
ggplot(aes(index, sentiment, fill = book)) +
geom_bar(alpha = 0.5")
主题开掘 LDA 和情感剖析图书馆话题知乎用户问答行为数据
summarise(sentiment = sum(score)) %>%
mutate(method = "AFINN")
bing_and_nrc <-
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative"))) %>%
spread(sentiment, n, fill = 0) %>%
咱们当初有了对每个情感词库的小说文本中净情感(侧面 - 负面)的预计。让咱们把它们绘制进去。
ggplot(aes(index, sentiment, fill = method)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_grid(book ~ method)
计算情感的三种不同的词典给出的后果在相对意义上是不同的,但在小说中却有相当类似的绝对轨迹。咱们看到在小说中差不多雷同的中央有相似的情绪低谷和顶峰,但绝对值却显著不同。在某些状况下,AFINN 词典仿佛比 NRC 词典发现了更多踊跃的情绪。这个输入后果也使咱们可能在不同的小说之间进行比拟。首先,你能够很好地理解书籍长度的差别 –《菲尼克斯的秩序》比《哲学家的石头》长很多。其次,你能够比拟一个系列中的书籍在情感方面的不同。
## # A tibble: 3,313 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 like positive 2416
## 2 well positive 1969
## 3 right positive 1643
## 4 good positive 1065
## 5 dark negative 1034
## 6 great positive 877
## 7 death negative 757
## 8 magic positive 606
## 9 better positive 533
## 10 enough positive 509
## # ... with 3,303 more rows
咱们能够直观地查看,以评估每种情绪的前 n 个词。
ggplot(aes(reorder(word, n), n, fill = sentiment)) +
geom_bar(alpha = 0.8, stat = "identity"
是一个悲伤的句子,而不是一个高兴的句子,因为有否定词。斯坦福大学的 CoreNLP 工具是这类情感剖析算法的例子。对于这些,咱们可能想把文本标记为句子。我应用 philosophers\_stone 数据集来阐明。
tibble(text = philosophers_stone)
## sentence
## <chr>
## 1 the boy who lived mr. and mrs.
## 2 dursley, of number four, privet drive, were proud to say that they were per
## 3 they were the last people you'd expect to be involved in anything strange o
## 4 mr.
## 5 dursley was the director of a firm called grunnings, which made drills.
## 6 he was a big, beefy man with hardly any neck, although he did have a very l
## 7 mrs.
## 8 dursley was thin and blonde and had nearly twice the usual amount of neck,
## 9 the dursleys had a small son called dudley and in their opinion there was n
## 10 the dursleys had everything they wanted, but they also had a secret, and th
## # ... with 6,588 more rows
参数 token = “ 句子 “ 试图通过标点符号来宰割文本。
让咱们持续按章节和句子来合成 philosophers\_stone 文本。
text = philosophers_stone) %>%
unnest_tokens(sentence, text, token = "sentences")
这将使咱们可能按章节和句子来评估净情绪。首先,咱们须要追踪句子的编号,而后我创立一个索引,追踪每一章的进度。而后,我按字数对句子进行解嵌。这就给了咱们一个 tibble,其中有每一章中按句子分列的单个词。当初,像以前一样,我退出 AFINN 词典,并计算每一章的净情感分数。咱们能够看到,最踊跃的句子是第 9 章的一半,第 17 章的开端,第 4 章的晚期,等等。
group_by(chapter, index) %>%
summarise(sentiment = sum(score, na.rm = TRUE)) %>%
## Source: local data frame [1,401 x 3]
## Groups: chapter [17]
## chapter index sentiment
## <int> <dbl> <int>
## 1 9 0.47 14
## 2 17 0.91 13
## 3 4 0.11 12
## 4 12 0.45 12
## 5 17 0.54 12
## 6 1 0.25 11
## 7 10 0.04 11
## 8 10 0.16 11
## 9 11 0.48 11
## 10 12 0.70 11
## # ... with 1,391 more rows
ggplot(book_sent) +
geom_tile(color = "white") +
点击文末 “浏览原文”
本文选自《R 语言文本开掘、情感剖析和可视化哈利波特小说文本数据》。
