BERT生成文本摘要

53次阅读

共计 6443 个字符,预计需要花费 17 分钟才能阅读完成。

作者 |Daulet Nurmanbetov
编译 |VK
起源 |Towards Data Science

你有没有已经须要把一份简短的文件演绎成摘要?或者为一份文件提供一份摘要?如你所知,这个过程对咱们人类来说是乏味而迟缓的——咱们须要浏览整个文档,而后专一于重要的句子,最初,将句子从新写成一个连贯的摘要。

这就是主动摘要能够帮忙咱们的中央。机器学习在总结方面获得了长足的提高,但仍有很大的倒退空间。通常,机器摘要分为两种类型

摘要提取:如果重要句子呈现在原始文件中,提取它。

总结摘要:总结文件中蕴含的重要观点或事实,不要反复文章里的话。这是咱们在被要求总结一份文件时通常会想到的。

我想向你展现最近的一些后果,用 BERT_Sum_Abs 总结摘要,Yang Liu 和 Mirella Lapata 的工作 Text Summarization with Pretrained Encoders:https://arxiv.org/pdf/1908.08…

BERT 总结摘要的性能

摘要旨在将文档压缩成较短的版本,同时保留其大部分含意。总结摘要工作须要语言生成能力来创立蕴含源文档中没有的新单词和短语的摘要。摘要抽取通常被定义为一个二值分类工作,其标签批示摘要中是否应该蕴含一个文本范畴(通常是一个句子)。

上面是 BERT_Sum_Abs 如何解决规范摘要数据集:CNN 和 Daily Mail,它们通常用于基准测试。评估指标被称为 ROGUE F1 分数

结果表明,BERT_Sum_Abs 模型的性能优于大多数基于非 Transformer 的模型。更好的是,模型背地的代码是开源的,实现能够在 Github 上取得 (https://github.com/huggingfac…。

示范和代码

让咱们通过一个例子来总结一篇文章。咱们将抉择以下文章来总结摘要, 美联储官员说,各国央行行长统一应答冠状病毒。这是全文

The Federal Reserve Bank of New York president, John C. Williams, made clear on Thursday evening that officials viewed the emergency rate cut they approved earlier this week as part of an international push to cushion the economy as the coronavirus threatens global growth.
Mr. Williams, one of the Fed’s three key leaders, spoke in New York two days after the Fed slashed borrowing costs by half a point in its first emergency move since the depths of the 2008 financial crisis. The move came shortly after a call between finance ministers and central bankers from the Group of 7, which also includes Britain, Canada, France, Germany, Italy and Japan.“Tuesday’s phone call between G7 finance ministers and central bank governors, the subsequent statement, and policy actions by central banks are clear indications of the close alignment at the international level,”Mr. Williams said in a speech to the Foreign Policy Association.
Rate cuts followed in Canada, Asia and the Middle East on Wednesday. The Bank of Japan and European Central Bank — which already have interest rates set below zero — have yet to further cut borrowing costs, but they have pledged to support their economies.
Mr. Williams’s statement is significant, in part because global policymakers were criticized for failing to satisfy market expectations for a coordinated rate cut among major economies. Stock prices temporarily rallied after the Fed’s announcement, but quickly sank again.
Central banks face challenges in offsetting the economic shock of the coronavirus.
Many were already working hard to stoke stronger economic growth, so they have limited room for further action. That makes the kind of carefully orchestrated, lock step rate cut central banks undertook in October 2008 all but impossible.
Interest rate cuts can also do little to soften the near-term hit from the virus, which is forcing the closure of offices and worker quarantines and delaying shipments of goods as infections spread across the globe.“It’s up to individual countries, individual fiscal policies and individual central banks to do what they were going to do,”Fed Chair Jerome H. Powell said after the cut, noting that different nations had“different situations.”Mr. Williams reiterated Mr. Powell’s pledge that the Fed would continue monitoring risks in the“weeks and months”ahead. Economists widely expect another quarter-point rate cut at the Fed’s March 18 meeting.
The New York Fed president, whose reserve bank is partly responsible for ensuring financial markets are functioning properly, also promised that the Fed stood ready to act as needed to make sure that everything is working smoothly.
Since September, when an obscure but crucial corner of money markets experienced unusual volatility, the Fed has been temporarily intervening in the market to keep it calm. The goal is to keep cash flowing in the market for overnight and short-term loans between banks and other financial institutions. The central bank has also been buying short-term government debt.“We remain flexible and ready to make adjustments to our operations as needed to ensure that monetary policy is effectively implemented and transmitted to financial markets and the broader economy,”Mr. Williams said Thursday.

首先,咱们须要获取模型代码,装置依赖项并下载数据集,如下所示,你能够在本人的 Linux 计算机上轻松执行这些操作:

# 装置 Huggingface 的 Transformers
git clone https://github.com/huggingface/transformers && cd transformers
pip install .
pip install nltk py-rouge
cd examples/summarization

#------------------------------
# 下载原始摘要数据集。代码从 Linux 上的谷歌驱动器下载
wget --save-cookies cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/Code: \1\n/p'
wget --load-cookies cookies.txt --no-check-certificate 'https://drive.google.com/uc?export=download&confirm=<CONFIRMATION CODE HERE>&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ' -O cnn_stories.tgz

wget --save-cookies cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfM1BxdkxVaTY2bWs' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/Code: \1\n/p'
wget --load-cookies cookies.txt --no-check-certificate 'https://drive.google.com/uc?export=download&confirm=<CONFIRMATION CODE HERE>&id=0BwmD_VLjROrfM1BxdkxVaTY2bWs' -O dailymail_stories.tgz

# 解压文件
tar -xvf cnn_stories.tgz && tar -xvf dailymail_stories.tgz
rm cnn_stories.tgz dailymail_stories.tgz

#将文章挪动到一个地位
mkdir bertabs/dataset
mkdir bertabs/summaries_out
cp -r bertabs/cnn/stories dataset
cp -r bertabs/dailymail/stories dataset

# 抉择要总结摘要的文章子集
mkdir bertabs/dataset2
cd bertabs/dataset && find . -maxdepth 1 -type f | head -1000 | xargs cp -t ../dataset2/

在执行了下面的代码之后,咱们当初执行上面所示的 python 命令来总结 /dataset2 目录中的文档摘要:

python run_summarization.py \
    --documents_dir bertabs/dataset2 \
    --summaries_output_dir bertabs/summaries_out \
    --batch_size 64 \
    --min_length 50 \
    --max_length 200 \
    --beam_size 5 \
    --alpha 0.95 \
    --block_trigram true \
    --compute_rouge true

这里的参数如下

documents_dir, 文档所在的文件夹

summaries_output_dir, 写入摘要的文件夹。默认为文档所在的文件夹

batch_size,用于训练的每个 GPU/CPU 的 batch 大小

beam_size,每个示例要开始的集束数

block_trigram,是否阻止由集束搜寻生成的文本中反复的 trigram

compute_rouge,计算评估期间的 ROUGE 指标。仅实用于 CNN/DailyMail 数据集

alpha,集束搜寻中长度惩办的 alpha 值(值越大,惩办越大)

min_length,摘要的最小标记数

max_length,摘要的最大标记数

BERT_Sum_Abs 实现后,咱们取得以下摘要:

The Fed slashed borrowing costs by half a point in its first emergency move since the depths of the 2008 financial crisis. Rate cuts followed in Canada, Asia and the Middle East on Wednesday. The Bank of Japan and European Central Bank have yet to further cut borrowing costs, but they have pledged to support their economies.

这是另一篇英语文章:https://news.stonybrook.edu/n…

失去的摘要如下

The research team focused on the Presymptomatic period during which prevention may be most effective. They showed that communication between brain regions destabilizes with age, typically in the late 40's, and that destabilization associated with poorer cognition. The good news is that we may be able to prevent or reverse these effects with diet, mitigating the impact of encroaching Hypometabolism by exchanging glucose for ketones as fuel for neurons.

论断

如你所见,BERT 正在改良 NLP 的各个方面。这意味着,在开源的同时,咱们每天都看到 NLP 的性能靠近人类的程度。

NLP 商业化产品正在靠近,每一个新的 NLP 模型不仅在基准上建设了新的记录,而且任何人都能够应用。就像 OCR 技术在 10 年前被商品化一样,NLP 在将来几年也将如此。

原文链接:https://towardsdatascience.co…

欢送关注磐创 AI 博客站:
http://panchuang.net/

sklearn 机器学习中文官网文档:
http://sklearn123.com/

欢送关注磐创博客资源汇总站:
http://docs.panchuang.net/

正文完
 0