关于人工智能:使用NLP创建摘要

作者 |Louis Teo
编译 |VK
起源 |Towards Data Science

你有没有读过很多的报告，而你只想对每个报告做一个疾速的总结摘要？你是否已经遇到过这样的状况？

摘要已成为 21 世纪解决数据问题的一种十分有帮忙的办法。在本篇文章中，我将向你展现如何应用 Python 中的自然语言解决（NLP）创立集体文本摘要生成器。

前言：集体文本摘要器不难创立——初学者能够轻松做到！

基本上，在放弃要害信息的同时，生成精确的摘要，而不失去整体意义，这是一项工作。

摘要有两种个别类型：

形象摘要>> 从原文中生成新句子。
提取摘要>> 辨认重要句子，并应用这些句子创立摘要。

我应用提取摘要，因为我能够将此办法利用于许多文档，而不用执行大量（令人畏惧）的机器学习模型训练任务。

此外，提取摘要法比形象摘要具备更好的总结成果，因为形象摘要必须从原文中生成新的句子，这是一种比数据驱动的办法提取重要句子更艰难的办法。

咱们将应用单词直方图来对句子的重要性进行排序，而后创立一个总结。这样做的益处是，你不须要训练你的模型来将其用于文档。

上面是咱们将要遵循的工作流…

导入文本 >>>> 清理文本并拆分成句子 >> 删除停用词 >> 构建单词直方图 >> 排名句子 >> 抉择前 N 个句子进行提取摘要

我用了一篇新闻文章的文本，题目是苹果以 5000 万美元收买 AI 初创公司，以推动其应用程序。你能够在这里找到原始的新闻文章：https://analyticsindiamag.com…

你还能够从我的 Github 下载文本文档：https://github.com/louisteo9/…

# 自然语言工具包（NLTK）import nltk
nltk.download('stopwords')

# 文本预处理的正则表达式
import re

# 队列算法求首句
import heapq

# 数值计算的 NumPy
import numpy as np

# 用于创立数据帧的 pandas
import pandas as pd

# matplotlib 绘图
from matplotlib import pyplot as plt
%matplotlib inline

有很多办法能够做到。这里的指标是有一个洁净的文本，咱们能够输出到咱们的模型中。

# 加载文本文件
with open('Apple_Acquires_AI_Startup.txt', 'r') as f:
    file_data = f.read()

这里，咱们应用正则表达式来进行文本预处理。咱们将

（A）用空格（如果有的话…）替换参考编号，即[1]、[10]、[20]，

（B）用单个空格替换一个或多个空格。

text = file_data
# 如果有，请用空格替换
text = re.sub(r'\[[0-9]*\]',' ',text) 

# 用单个空格替换一个或多个空格
text = re.sub(r'\s+',' ',text)

而后，咱们用小写（不带特殊字符、数字和额定空格）造成一个洁净的文本，并将其宰割成单个单词，用于词组分数计算和构词直方图。

造成一个洁净文本的起因是，算法不会把“了解”和“了解”作为两个不同的词来解决。

# 将所有大写字符转换为小写字符
clean_text = text.lower()

# 用空格替换 [a-zA-Z0-9] 以外的字符
clean_text = re.sub(r'\W',' ',clean_text) 

# 用空格替换数字
clean_text = re.sub(r'\d',' ',clean_text) 

# 用单个空格替换一个或多个空格
clean_text = re.sub(r'\s+',' ',clean_text)

咱们应用 NLTK sent_tokenize 办法将文本拆分为句子。咱们将评估每一句话的重要性，而后决定是否应该将每一句都蕴含在总结中。

sentences = nltk.sent_tokenize(text)

停用词是指不给句子减少太多意义的英语单词。他们能够平安地被疏忽，而不就义句子的意义。咱们曾经下载了一个文件，其中蕴含英文停用词

这里，咱们将失去停用词的列表，并将它们存储在stop_word 变量中。

# 获取停用词列表
stop_words = nltk.corpus.stopwords.words('english')

让咱们依据每个单词在整个文本中呈现的次数来评估每个单词的重要性。

咱们将通过（1）将单词拆分为洁净的文本，（2）删除停用词，而后（3）查看文本中每个单词的频率。

# 创立空字典以包容单词计数
word_count = {}

# 循环遍历标记化的单词，删除停用单词并将单词计数保留到字典中
for word in nltk.word_tokenize(clean_text):
    # remove stop words
    if word not in stop_words:
        # 将字数保留到词典
        if word not in word_count.keys():
            word_count[word] = 1
        else:
            word_count[word] += 1

让咱们绘制单词直方图并查看后果。

plt.figure(figsize=(16,10))
plt.xticks(rotation = 90)
plt.bar(word_count.keys(), word_count.values())
plt.show()

让咱们把它转换成横条图，只显示前 20 个单词，上面有一个 helper 函数。

# helper 函数，用于绘制最下面的单词。def plot_top_words(word_count_dict, show_top_n=20):
    word_count_table = pd.DataFrame.from_dict(word_count_dict, orient = 'index').rename(columns={0: 'score'})
    
    word_count_table.sort_values(by='score').tail(show_top_n).plot(kind='barh', figsize=(10,10))
    plt.show()

让咱们展现前 20 个单词。

plot_top_words(word_count, 20)

从下面的图中，咱们能够看到“ai”和“apple”两个词呈现在顶部。这是有情理的，因为这篇文章是对于苹果收买一家人工智能初创公司的。

当初，咱们将依据句子得分对每个句子的重要性进行排序。咱们将：

删除超过 30 个单词的句子，意识到长句未必总是有意义的；
而后，从形成句子的每个单词中加上分数，造成句子分数。

高分的句子将排在后面。后面的句子将造成咱们的总结。

留神：依据我的教训，任何 25 到 30 个单词都能够给你一个很好的总结。

# 创立空字典来存储句子分数
sentence_score = {}

# 循环通过标记化的句子，只取少于 30 个单词的句子，而后加上单词分数来造成句子分数
for sentence in sentences:
    # 查看句子中的单词是否在字数字典中
    for word in nltk.word_tokenize(sentence.lower()):
        if word in word_count.keys():
            # 只承受少于 30 个单词的句子
            if len(sentence.split(' ')) < 30:
                # 把单词分数加到句子分数上
                if sentence not in sentence_score.keys():
                    sentence_score[sentence] = word_count[word]
                else:
                    sentence_score[sentence] += word_count[word]

咱们将句子 - 分数字典转换成一个数据框，并显示sentence_score。

留神：字典不容许依据分数对句子进行排序，因而须要将字典中存储的数据转换为 DataFrame。

df_sentence_score = pd.DataFrame.from_dict(sentence_score, orient = 'index').rename(columns={0: 'score'})
df_sentence_score.sort_values(by='score', ascending = False)

咱们应用堆队列算法来抉择前 3 个句子，并将它们存储在 best_quences 变量中。

通常 3 - 5 句话就足够了。依据文档的长度，能够随便更改要显示的最下面的句子数。

在本例中，我抉择了 3，因为咱们的文本绝对较短。

# 展现最好的三句话作为总结         
best_sentences = heapq.nlargest(3, sentence_score, key=sentence_score.get)

让咱们应用 print 和 for loop 函数显示摘要文本。

print('SUMMARY')
print('------------------------')

# 依据原文中的句子程序显示最下面的句子
for sentence in sentences:
    if sentence in best_sentences:
        print (sentence)

这是到我的 Github 的链接以获取 Jupyter 笔记本。你还将找到一个可执行的 Python 文件，你能够立刻应用它来总结你的文本：https://github.com/louisteo9/…

以下是一篇题为“苹果以 5000 万美元收买人工智能守业公司（Apple Acquire AI Startup）以推动其应用程序”的新闻文章的原文（原文可在此处找到）：https://analyticsindiamag.com…

In an attempt to scale up its AI portfolio, Apple has acquired Spain-based AI video startup — Vilynx for approximately $50 million.

Reported by Bloomberg, the AI startup — Vilynx is headquartered in Barcelona, which is known to build software using computer vision to analyse a video’s visual, text, and audio content with the goal of“understanding”what’s in the video. This helps it categorising and tagging metadata to the videos, as well as generate automated video previews, and recommend related content to users, according to the company website.

Apple told the media that the company typically acquires smaller technology companies from time to time, and with the recent buy, the company could potentially use Vilynx’s technology to help improve a variety of apps. According to the media, Siri, search, Photos, and other apps that rely on Apple are possible candidates as are Apple TV, Music, News, to name a few that are going to be revolutionised with Vilynx’s technology.

With CEO Tim Cook’s vision of the potential of augmented reality, the company could also make use of AI-based tools like Vilynx.

The purchase will also advance Apple’s AI expertise, adding up to 50 engineers and data scientists joining from Vilynx, and the startup is going to become one of Apple’s key AI research hubs in Europe, according to the news.

Apple has made significant progress in the space of artificial intelligence over the past few months, with this purchase of UK-based Spectral Edge last December, Seattle-based Xnor.ai for $200 million and Voysis and Inductiv to help it improve Siri. With its habit of quietly purchasing smaller companies, Apple is making a mark in the AI space. In 2018, CEO Tim Cook said in an interview that the company had bought 20 companies over six months, while only six were public knowledge.

摘要如下：

SUMMARY
------------------------
In an attempt to scale up its AI portfolio, Apple has acquired Spain-based AI video startup — Vilynx for approximately $50 million.
With CEO Tim Cook’s vision of the potential of augmented reality, the company could also make use of AI-based tools like Vilynx.
With its habit of quietly purchasing smaller companies, Apple is making a mark in the AI space.

恭喜你！你曾经在 Python 中创立了你的集体文本摘要器。我心愿，摘要看起来很不错。

原文链接：https://towardsdatascience.co…

欢送关注磐创 AI 博客站：
http://panchuang.net/

sklearn 机器学习中文官网文档：
http://sklearn123.com/

欢送关注磐创博客资源汇总站：
http://docs.panchuang.net/

关于人工智能:使用NLP创建摘要

什么是文本摘要

应该应用哪种总结办法

如何创立本人的文本摘要器

文本摘要工作流

（1）示例文本

（2）导入库

（3）导入文本并执行预处理

（4）将文本拆分为句子

（5）删除停用词

（6）构建直方图

（7）依据分数排列句子

（8）抉择后面的句子作为摘要

让咱们看看算法的实际操作！

结尾