关于人工智能:路透社文章的文本数据分析与可视化

作者 |Manmohan Singh
编译 |VK
起源 |Towards Datas Science

当我要求你解释文本数据时，你会怎么做？你将采取什么步骤来构建文本可视化？

本文将帮忙你取得构建可视化和解释文本数据所需的信息。

从文本数据中取得的见解将有助于咱们发现文章之间的分割。它将检测趋势和模式。对文本数据的剖析将排除乐音，发现以前未知的信息。

这种剖析过程也称为探索性文本剖析(ETA)。使用 K -means、Tf-IDF、词频等办法对这些文本数据进行剖析。此外，ETA 在数据清理过程中也很有用。

咱们还应用 Matplotlib、seaborn 和 Plotly 库将后果可视化到图形、词云和绘图中。

在剖析文本数据之前，请实现这些预处理工作。

有很多非结构化文本数据可供剖析。你能够从以下起源获取数据。

来自 Kaggle 的 Twitter 文本数据集。
Reddit 和 twitter 数据集应用 API。
应用 Beautifulsoup 从网站上获取文章、。

我将应用路透社的 SGML 格局的文章。为了便于剖析，我将应用 beauthoulsoup 库从数据文件中获取日期、题目和文章正文。

应用上面的代码从所有数据文件中获取数据，并将输入存储在单个 CSV 文件中。

from bs4 import BeautifulSoup
import pandas as pd
import csv

article_dict = {}
i = 0
list_of_data_num = []

for j in range(0,22):
    if j < 10:
        list_of_data_num.append("00" + str(j))
    else:
        list_of_data_num.append("0" + str(j))

# 循环所有文章以提取日期、题目和文章主体
for num in list_of_data_num:
    try:
        soup = BeautifulSoup(open("data/reut2-" + num + ".sgm"), features='lxml')
    except:
        continue
    print(num)
    data_reuters = soup.find_all('reuters')
    for data in data_reuters:
        article_dict[i] = {}
        for date in data.find_all('date'):
            try:
                article_dict[i]["date"] = str(date.contents[0]).strip()
            except:
                article_dict[i]["date"] = None
            # print(date.contents[0])
        for title in data.find_all('title'):
            article_dict[i]["title"] = str(title.contents[0]).strip()
            # print(title.contents)
        for text in data.find_all('text'):
            try:
                article_dict[i]["text"] = str(text.contents[4]).strip()
            except:
                article_dict[i]["text"] = None
        i += 1


dataframe_article = pd.DataFrame(article_dict).T
dataframe_article.to_csv('articles_data.csv', header=True, index=False, quoting=csv.QUOTE_ALL)
print(dataframe_article)

还能够应用 Regex 和 OS 库组合或循环所有数据文件。
每篇文章的注释以 <Reuters> 结尾，因而应用 find_all(‘reuters’)。
你也能够应用 pickle 模块来保留数据，而不是 CSV。

在本节中，咱们将从文本数据中移除诸如空值、标点符号、数字等噪声。首先，咱们删除文本列中蕴含空值的行。而后咱们解决另一列的空值。

import pandas as pd import re

articles_data = pd.read_csv(‘articles_data.csv’) print(articles_data.apply(lambda x: sum(x.isnull()))) articles_nonNull = articles_data.dropna(subset=[‘text’]) articles_nonNull.reset_index(inplace=True)

def clean_text(text):‘’’Make text lowercase, remove text in square brackets,remove \n,remove punctuation and remove words containing numbers.’’’text = str(text).lower()
    text = re.sub(‘<.*?>+’,‘’, text)
    text = re.sub(‘[%s]’% re.escape(string.punctuation),‘’, text)
    text = re.sub(‘\n’,‘’, text)
    text = re.sub(‘\w*\d\w*’,‘’, text)
    return text

articles_nonNull[‘text_clean’]=articles_nonNull[‘text’]\
                                  .apply(lambda x:clean_text(x))

当咱们删除文本列中的空值时，其余列中的空值也会隐没。
咱们应用 re 办法去除文本数据中的噪声。

数据清理过程中采取的步骤可能会依据文本数据减少或缩小。因而，请认真钻研你的文本数据并相应地构建 clean_text()办法。

随着预处理工作的实现，咱们将持续剖析文本数据。

让咱们从剖析开始。

咱们晓得所有文章的篇幅不一样。因而，咱们将思考长度等于或超过一段的文章。依据钻研，一个句子的均匀长度是 15-20 个单词。一个段落应该有四个句子。

articles_nonNull[‘word_length’] = articles_nonNull[‘text’].apply(lambda x: len(str(x).split())) print(articles_nonNull.describe())

articles_word_limit = articles_nonNull[articles_nonNull[‘word_length’] > 60]

plt.figure(figsize=(12,6)) 
p1=sns.kdeplot(articles_word_limit[‘word_length’], shade=True, color=”r”).set_title(‘Kernel Distribution of Number Of words’)

我删除了那些篇幅有余 60 字的文章。
字长散布是右偏的。
大多数文章有 150 字左右。
蕴含事实或股票信息的路透社文章用词较少。

在这一部分中，咱们统计了文章中呈现的字数，并对后果进行了剖析。咱们基于 N -gram 办法对词数进行了剖析。N-gram 是基于 N 值的单词的呈现。

咱们将从文本数据中删除停用词。因为停用词是乐音，在剖析中没有太大用处。

让咱们在条形图中绘制 unigram 单词，并为 unigram 单词绘制词云。

from gensim.parsing.preprocessing 
import remove_stopwords                       
import genism                                                  
from wordcloud import WordCloud                                   
import numpy as np                                           
import random                                              

# 从 gensim 办法导入 stopwords 到 stop_list 变量
# 你也能够手动增加 stopwords
gensim_stopwords = gensim.parsing.preprocessing.STOPWORDS               
stopwords_list = list(set(gensim_stopwords))                               
stopwords_update = ["mln", "vs","cts","said","billion","pct","dlrs","dlr"]                      
stopwords = stopwords_list + stopwords_update
articles_word_limit['temp_list'] = articles_word_limit['text_clean'].apply(lambda x:str(x).split())

# 从文章中删除停用词
def remove_stopword(x):
    return [word for word in x if word not in stopwords]
articles_word_limit['temp_list_stopw'] = articles_word_limit['temp_list'].apply(lambda x:remove_stopword(x))

# 生成 ngram 的单词
def generate_ngrams(text, n_gram=1):
    ngrams = zip(* for i in range(n_gram)])
    return [' '.join(ngram) for ngram in ngrams]
  
  
article_unigrams = defaultdict(int)
for tweet in articles_word_limit['temp_list_stopw']:
    for word in generate_ngrams(tweet):
        article_unigrams[word] += 1
        
article_unigrams_df = pd.DataFrame(sorted(article_unigrams.items(), key=lambda x: x[1])[::-1])
N=50

# 在路透社的文章中前 50 个罕用的 unigram
fig, axes = plt.subplots(figsize=(18, 50))
plt.tight_layout()
sns.barplot(y=article_unigrams_df[0].values[:N], x=article_unigrams_df[1].values[:N], color='red')
axes.spines['right'].set_visible(False)
axes.set_xlabel('')
axes.set_ylabel('')
axes.tick_params(axis='x', labelsize=13)
axes.tick_params(axis='y', labelsize=13)
axes.set_title(f'Top {N} most common unigrams in Reuters Articles', fontsize=15)
plt.show()


# 画出词云
def col_func(word, font_size, position, orientation, font_path, random_state):
    colors = ['#b58900', '#cb4b16', '#dc322f', '#d33682', '#6c71c4',
              '#268bd2', '#2aa198', '#859900']
    return random.choice(colors)
fd = {
    'fontsize': '32',
    'fontweight' : 'normal',
    'verticalalignment': 'baseline',
    'horizontalalignment': 'center',
}
wc = WordCloud(width=2000, height=1000, collocations=False,
               background_color="white",
               color_func=col_func,
               max_words=200,
               random_state=np.random.randint(1, 8)) .generate_from_frequencies(article_unigrams)
fig, ax = plt.subplots(figsize=(20,10))
ax.imshow(wc, interpolation='bilinear')
ax.axis("off")
ax.set_title(‘Unigram Words of Reuters Articles’, pad=24, fontdict=fd)
plt.show()

Share, trade, stock 是一些最常见的词汇，它们是基于股票市场和金融行业的文章。

因而，咱们能够说，大多数路透社文章属于金融和股票类。

让咱们为 Bigram 单词绘制条形图和词云。

article_bigrams = defaultdict(int)
for tweet in articles_word_limit[‘temp_list_stopw’]:
    for word in generate_ngrams(tweet, n_gram=2):
        article_bigrams[word] += 1
        
df_article_bigrams=pd.DataFrame(sorted(article_bigrams.items(),
                                key=lambda x: x[1])[::-1])
                                
N=50

# 前 50 个单词的柱状图
fig, axes = plt.subplots(figsize=(18, 50), dpi=100)
plt.tight_layout()
sns.barplot(y=df_article_bigrams[0].values[:N],
            x=df_article_bigrams[1].values[:N], 
            color=’red’)
axes.spines[‘right’].set_visible(False)
axes.set_xlabel(‘’)
axes.set_ylabel(‘’)
axes.tick_params(axis=’x’, labelsize=13)
axes.tick_params(axis=’y’, labelsize=13)
axes.set_title(f’Top {N} most common Bigrams in Reuters Articles’,
               fontsize=15)
plt.show()

#词云
wc = WordCloud(width=2000, height=1000, collocations=False,
               background_color=”white”,
               color_func=col_func,
               max_words=200,
               random_state=np.random.randint(1,8))\
               .generate_from_frequencies(article_bigrams)
               
fig, ax = plt.subplots(figsize=(20,10))
ax.imshow(wc, interpolation=’bilinear’)
ax.axis(“off”)
ax.set_title(‘Trigram Words of Reuters Articles’, pad=24,
             fontdict=fd)
plt.show()

Bigram 比 unigram 提供更多的文本信息和上下文。比方，share loss 显示：大多数人在股票上亏损。

让咱们为 trigma 单词绘制条形图和词云。

article_trigrams = defaultdict(int)
for tweet in articles_word_limit[‘temp_list_stopw’]:
    for word in generate_ngrams(tweet, n_gram=3):
        article_trigrams[word] += 1
df_article_trigrams = pd.DataFrame(sorted(article_trigrams.items(),
                                   key=lambda x: x[1])[::-1])
                                   
N=50

# 柱状图的前 50 个 trigram 
fig, axes = plt.subplots(figsize=(18, 50), dpi=100)
plt.tight_layout()
sns.barplot(y=df_article_trigrams[0].values[:N],
            x=df_article_trigrams[1].values[:N], 
            color=’red’)
axes.spines[‘right’].set_visible(False)
axes.set_xlabel(‘’)
axes.set_ylabel(‘’)
axes.tick_params(axis=’x’, labelsize=13)
axes.tick_params(axis=’y’, labelsize=13)
axes.set_title(f’Top {N} most common Trigrams in Reuters articles’,
               fontsize=15)
plt.show()

# 词云
wc = WordCloud(width=2000, height=1000, collocations=False,
background_color=”white”,
color_func=col_func,
max_words=200,
random_state=np.random.randint(1,8)).generate_from_frequencies(article_trigrams)
fig, ax = plt.subplots(figsize=(20,10))
ax.imshow(wc, interpolation=’bilinear’)
ax.axis(“off”)
ax.set_title(‘Trigrams Words of Reuters Articles’, pad=24,
             fontdict=fd)
plt.show()

大多数的三元组都与双元组类似，但无奈提供更多信息。所以咱们在这里完结这一部分。

NER 是从文本数据中提取特定信息的过程。在 NER 的帮忙下，咱们从文本中提取地位、人名、日期、数量和组织实体。在这里理解 NER 的更多信息。咱们应用 Spacy python 库来实现这项工作。

import spacy    
from matplotlib import cm
from matplotlib.pyplot import plt

nlp = spacy.load('en_core_web_sm')
ner_collection = {"Location":[],"Person":[],"Date":[],"Quantity":[],"Organisation":[]}
location = []
person = []
date = []
quantity = []
organisation = []
def ner_text(text):
    doc = nlp(text)
    ner_collection = {"Location":[],"Person":[],"Date":[],"Quantity":[],"Organisation":[]}
    for ent in doc.ents:
        if str(ent.label_) == "GPE":
            ner_collection['Location'].append(ent.text)
            location.append(ent.text)
        elif str(ent.label_) == "DATE":
            ner_collection['Date'].append(ent.text)
            person.append(ent.text)
        elif str(ent.label_) == "PERSON":
            ner_collection['Person'].append(ent.text)
            date.append(ent.text)
        elif str(ent.label_) == "ORG":
            ner_collection['Organisation'].append(ent.text)
            quantity.append(ent.text)
        elif str(ent.label_) == "QUANTITY":
            ner_collection['Quantity'].append(ent.text)
            organisation.append(ent.text)
        else:
            continue
    return ner_collection
   articles_word_limit['ner_data'] = articles_word_limit['text'].map(lambda x: ner_text(x))
    
location_name = []
location_count = []
for i in location_counts.most_common()[:10]:
    location_name.append(i[0].upper())
    location_count.append(i[1])


fig, ax = plt.subplots(figsize=(15, 8), dpi=100)
ax.barh(location_name, location_count, alpha=0.7,
         # width = 0.5,
        color=cm.Blues([i / 0.00525 for i in [ 0.00208, 0.00235, 0.00281, 0.00317, 0.00362,
                                              0.00371, 0.00525, 0.00679, 0.00761, 0.00833]])
        )
plt.rcParams.update({'font.size': 10})
rects = ax.patches
for i, label in enumerate(location_count):
    ax.text(label+100 , i, str(label), size=10, ha='center', va='center')
ax.text(0, 1.02, 'Count of Location name Extracted from Reuters Articles', 
        transform=ax.transAxes, size=12, weight=600, color='#777777')
ax.xaxis.set_ticks_position('bottom')
ax.tick_params(axis='y', colors='black', labelsize=12)
ax.set_axisbelow(True)
ax.text(0, 1.08, 'TOP 10 Location Mention in Reuters Articles',
        transform=ax.transAxes, size=22, weight=600, ha='left')
ax.text(0, -0.1, 'Source: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html',
        transform=ax.transAxes, size=12, weight=600, color='#777777')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
plt.tick_params(axis='y',which='both', left=False, top=False, labelbottom=False)
ax.set_xticks([])
plt.show()

从这个图表中，你能够说大多数文章都蕴含来自美国、日本、加拿大、伦敦和中国的新闻。
对美国的高度评价代表了路透在美业务的重点。
person 变量示意 1987 年谁是名人。这些信息有助于咱们理解这些人。
organization 变量蕴含世界上提到最多的组织。

咱们将在应用 TF-IDF 的文章中找到惟一的词汇。词频 (TF) 是每篇文章的字数。反向文档频率 (IDF) 同时思考所有提到的文章并掂量词的重要性，。

TF-IDF 得分较高的词在一篇文章中的数量较高，而在其余文章中很少呈现或不存在。

让咱们计算 TF-IDF 分数并找出惟一的单词。

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(use_idf=True)
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(articles_word_limit[‘text_clean’])
tfidf = tfidf_vectorizer_vectors.todense()
tfidf[tfidf == 0] = np.nan

# 应用 numpy 的 nanmean，在计算均值时疏忽 nan
means = np.nanmean(tfidf, axis=0)

# 将其转换为一个字典，以便当前查找
Means_words = dict(zip(tfidf_vectorizer.get_feature_names(),
                       means.tolist()[0]))
unique_words=sorted(means_words.items(),
                    key=lambda x: x[1],
                    reverse=True)
print(unique_words)

K-Means 是一种无监督的机器学习算法。它有助于咱们在一组中收集同一类型的文章。咱们能够通过初始化 k 值来确定组或簇的数目。理解更多对于 K -Means 以及如何在这里抉择 K 值。作为参考，我抉择 k =4。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

vectorizer = TfidfVectorizer(stop_words=’english’,use_idf=True)
X = vectorizer.fit_transform(articles_word_limit[‘text_clean’])
k = 4
model = KMeans(n_clusters=k, init=’k-means++’,
               max_iter=100, n_init=1)
model.fit(X)
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
clusters = model.labels_.tolist()
articles_word_limit.index = clusters
for i in range(k):
    print(“Cluster %d words:”% i, end=’’)

for title in articles_word_limit.ix[i
                    [[‘text_clean’,’index’]].values.tolist():
    print(‘%s,’% title, end=’’)

它有助于咱们将文章按不同的组进行分类，如体育、货币、金融等。K-Means 的准确性广泛较低。

NER 和 K -Means 是我最喜爱的分析方法。其他人可能喜爱 N -gram 和 Unique words 办法。在本文中，我介绍了最驰名和闻所未闻的文本可视化和分析方法。本文中的所有这些办法都是举世无双的，能够帮忙你进行可视化和剖析。

我心愿这篇文章能帮忙你发现文本数据中的未知数。

原文链接：https://towardsdatascience.co…

欢送关注磐创 AI 博客站：
http://panchuang.net/

sklearn 机器学习中文官网文档：
http://sklearn123.com/

欢送关注磐创博客资源汇总站：
http://docs.panchuang.net/

关于人工智能:路透社文章的文本数据分析与可视化

从数据源检索数据

荡涤数据

1. 路透社文章篇幅

2. 路透社文章中的常用词

1 最常见的单字单词(N=1)

2. 最常见的 Bigram 词(N=2)

3. 最罕用的 Trigram 词

3. 文本数据的命名实体辨认 (NER) 标记

4. 文本数据中的惟一词

5. 用 K - 均值聚类文章

论断