关于数据挖掘:R语言文本挖掘情感分析和可视化哈利波特小说文本数据附代码数据
全文下载链接:http://tecdat.cn/?p=22984最近咱们被客户要求撰写对于文本开掘的钻研报告,包含一些图形和统计输入。 一旦咱们清理了咱们的文本并进行了一些根本的词频剖析,下一步就是理解文本中的观点或情感。这被认为是情感剖析,本教程将疏导你通过一个简略的办法来进行情感剖析 。 简而言之本教程是对情感剖析的一个介绍。本教程建设在tidy text教程的根底上,所以如果你没有读过该教程,我倡议你从那里开始。在本教程中,我包含以下内容。 要求:重现本教程中的剖析须要什么?情感数据集:用来对情感进行评分的次要数据集根本情感剖析:执行根本的情感剖析比拟情感:比拟情感库中的情感差别常见的情绪词:找出最常见的踊跃和消极词汇大单元的情感剖析:在较大的文本单元中剖析情感,而不是单个词。复制要求本教程利用了harrypotter文本数据,以阐明文本开掘和剖析能力。 library(tidyverse) # 数据处理和绘图library(stringr) # 文本清理和正则表达式library(tidytext) # 提供额定的文本开掘性能咱们正在解决的七部小说,包含 philosophers_stone:《哈利-波特与魔法石》(1997)。chamber_of_secrets: 《哈利-波特与密室》(1998)阿兹卡班的囚徒(prisoner_of_azkaban)。Harry Potter and the Prisoner of Azkaban (1999)Goblet_of_fire: Harry Potter and the Goblet of Fire (2000)Order_of_the_phoenix: Harry Potter and the Order of the Phoenix (2003)half_blood_prince: 哈利-波特与混血王子(2005)deathly_hallows: 哈利-波特与死亡圣器(2007)。每个文本都在一个字符矢量中,每个元素代表一个章节。例如,上面阐明了philosophers_stone的前两章的原始文本。 philosophers_stone[1:2]## [1] "THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank## you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold## with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly## any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck,## which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a## small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also## had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out## about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn'... <truncated>## [2] "THE VANISHING GLASS Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but## Privet Drive had hardly changed at all. The sun rose on the same tidy front gardens and lit up the brass number four on the Dursleys'## front door; it crept into their living room, which was almost exactly the same as it had been on the night when Mr. Dursley had seen## that fateful news report about the owls. Only the photographs on the mantelpiece really showed how much time had passed. Ten years ago,## there had been lots of pictures of what looked like a large pink beach ball wearing different-colored bonnets -- but Dudley Dursley was## no longer a baby, and now the photographs showed a large blond boy riding his first bicycle, on a carousel at the fair, playing a## computer game with his father, being hugged and kissed by his mother. The room held no sign at all that another boy lived in the house,## too. Yet Harry Potter was still there, asleep at the moment, but no... <truncated>情感数据集有各种各样的字典存在,用于评估文本中的观点或情感。tidytext包在sentiments数据集中蕴含了三个情感词典。 sentiments## # A tibble: 23,165 × 4## word sentiment lexicon score## <chr> <chr> <chr> <int>## 1 abacus trust nrc NA## 2 abandon fear nrc NA## 3 abandon negative nrc NA## 4 abandon sadness nrc NA## 5 abandoned anger nrc NA## 6 abandoned fear nrc NA## 7 abandoned negative nrc NA## 8 abandoned sadness nrc NA## 9 abandonment anger nrc NA## 10 abandonment fear nrc NA## # ... with 23,155 more rows这三个词库是 AFINNbingnrc这三个词库都是基于单字(或单词)的。这些词库蕴含了许多英语单词,这些单词被调配了踊跃/消极情绪的分数,也可能是高兴、愤恨、悲伤等情绪的分数。nrc词典以二元形式("是"/"否")将单词分为踊跃、消极、愤恨、期待、讨厌、恐怖、高兴、悲伤、诧异和信赖等类别。bing词库以二元形式将单词分为踊跃和消极类别。AFINN词库给单词打分,分数在-5到5之间,负分示意消极情绪,正分示意积极情绪。 # 查看单个词库get_sentiments("afinn")get_sentiments("bing")get_sentiments("nrc")根本情感剖析为了进行情感剖析,咱们须要将咱们的数据整顿成一个参差的格局。上面将所有七本《哈利-波特》小说转换为一个tibble,其中每个词都按章节按书排列。更多细节请参见整洁文本教程。 #设定因素,按出版程序保留书籍series$book <- factor(series$book, levels = rev(titles))series## # A tibble: 1,089,386 × 3## book chapter word## * <fctr> <int> <chr>## 1 Philosopher's Stone 1 the## 2 Philosopher's Stone 1 boy## 3 Philosopher's Stone 1 who## 4 Philosopher's Stone 1 lived## 5 Philosopher's Stone 1 mr## 6 Philosopher's Stone 1 and## 7 Philosopher's Stone 1 mrs## 8 Philosopher's Stone 1 dursley## 9 Philosopher's Stone 1 of## 10 Philosopher's Stone 1 number## # ... with 1,089,376 more rows当初让咱们应用nrc情感数据集来评估整个《哈利-波特》系列所代表的不同情感。咱们能够看到,负面情绪的存在比侧面情绪更强烈。 filter(!is.na(sentiment)) %>% count(sentiment, sort = TRUE)<!----> ## # A tibble: 10 × 2## sentiment n## <chr> <int>## 1 negative 56579## 2 positive 38324## 3 sadness 35866## 4 anger 32750## 5 trust 23485## 6 fear 21544## 7 anticipation 21123## 8 joy 14298## 9 disgust 13381## 10 surprise 12991这给出了一个很好的整体感觉,但如果咱们想理解每部小说的过程中情绪是如何变动的呢?要做到这一点,咱们要进行以下工作。 创立一个索引,将每本书按500个词离开;这是每两页的大抵字数,所以这将使咱们可能评估情绪的变动,甚至是在章节中的变动。用inner\_join连贯bing词典,以评估每个词的侧面和负面情绪。计算每两页有多少个侧面和负面的词扩散咱们的数据计算出净情绪(侧面-负面)。绘制咱们的数据<!----> ggplot(aes(index, sentiment, fill = book)) + geom_bar(alpha = 0.5") 当初咱们能够看到每部小说的情节是如何在故事的倒退轨迹中朝着更踊跃或更消极的情绪变动。 点击题目查阅往期内容 主题开掘LDA和情感剖析图书馆话题知乎用户问答行为数据 左右滑动查看更多 01 02 03 04 比拟情感有了情感词典的几种抉择,你可能想理解更多对于哪一种适宜你的目标的信息。让咱们应用所有三种情感词典,并查看它们对每部小说的不同之处。 summarise(sentiment = sum(score)) %>% mutate(method = "AFINN")bing_and_nrc <- inner_join(get_sentiments("nrc") %>% filter(sentiment %in% c("positive", "negative"))) %>% spread(sentiment, n, fill = 0) %>%咱们当初有了对每个情感词库的小说文本中净情感(侧面-负面)的预计。让咱们把它们绘制进去。 ggplot(aes(index, sentiment, fill = method)) + geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) + facet_grid(book ~ method) 计算情感的三种不同的词典给出的后果在相对意义上是不同的,但在小说中却有相当类似的绝对轨迹。咱们看到在小说中差不多雷同的中央有相似的情绪低谷和顶峰,但绝对值却显著不同。在某些状况下,AFINN词典仿佛比NRC词典发现了更多踊跃的情绪。这个输入后果也使咱们可能在不同的小说之间进行比拟。首先,你能够很好地理解书籍长度的差别--《菲尼克斯的秩序》比《哲学家的石头》长很多。其次,你能够比拟一个系列中的书籍在情感方面的不同。 常见情绪词同时领有情感和单词的数据框架的一个益处是,咱们能够剖析对每种情感有奉献的单词数。 word_counts## # A tibble: 3,313 × 3## word sentiment n## <chr> <chr> <int>## 1 like positive 2416## 2 well positive 1969## 3 right positive 1643## 4 good positive 1065## 5 dark negative 1034## 6 great positive 877## 7 death negative 757## 8 magic positive 606## 9 better positive 533## 10 enough positive 509## # ... with 3,303 more rows咱们能够直观地查看,以评估每种情绪的前n个词。 ...