相干文章:

本我的项目连贯:
https://aistudio.baidu.com/aistudio/projectdetail/4156802?contributionType=1

快递单中抽取要害信息

数据集链接:https://download.csdn.net/dow...
次要介绍:

  • PaddleNLP中的网络层BiGRU、CRF、ViterbiDecoder。
  • 通过paddlenlp.embedding的性能,热启动加载中文词向量,晋升成果
  • 评估指标paddlenlp.metrics.ChunkEvaluator

    PART A. 背景介绍

A.1 快递单信息抽取工作

如何从物流信息中抽取想要的要害信息呢?咱们首先要定义好须要抽取哪些字段。

比方当初拿到一个快递单,能够作为咱们的模型输出,例如“张三18625584663广东省深圳市南山区学府路东百度国内大厦”,那么序列标注模型的目标就是辨认出其中的“张三”为人名(用符号 P 示意),“18625584663”为电话名(用符号 T 示意),“广东省深圳市南山区百度国内大厦”别离是 1-4 级的地址(别离用 A1~A4 示意,能够释义为省、市、区、街道)。

这是一个典型的命名实体辨认(Named Entity Recognition,NER)场景,各实体类型及相应符号示意见下表:

抽取实体/字段符号抽取后果
姓名P张三
电话T18625584663
A1广东省
A2深圳市
A3南山区
具体地址A4百度国内大厦

A.2 序列标注模型

咱们能够用序列标注模型来解决快递单的信息抽取工作,上面具体介绍一下序列标注模型。

在序列标注工作中,个别会定义一个标签汇合,来示意所以可能取到的预测后果。在本案例中,针对须要被抽取的“姓名、电话、省、市、区、具体地址”等实体,标签汇合能够定义为:

label = {P-B, P-I, T-B, T-I, A1-B, A1-I, A2-B, A2-I, A3-B, A3-I, A4-B, A4-I, O}

每个标签的定义别离为:

标签定义
P-B姓名起始地位
P-I姓名两头地位或完结地位
T-B电话起始地位
T-I电话两头地位或完结地位
A1-B省份起始地位
A1-I省份两头地位或完结地位
A2-B城市起始地位
A2-I城市两头地位或完结地位
A3-B县区起始地位
A3-I县区两头地位或完结地位
A4-B具体地址起始地位
A4-I具体地址两头地位或完结地位
O无关字符

留神每个标签的后果只有 B、I、O 三种,这种标签的定义形式叫做 BIO 体系,也有稍麻烦一点的 BIESO 体系,这里不做开展。其中 B 示意一个标签类别的结尾,比方 P-B 指的是姓名的结尾;相应的,I 示意一个标签的连续。

对于句子“张三18625584663广东省深圳市南山区百度国内大厦”,每个汉字及对应标签为:

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/1f716a6ad48649cc99c56c27108773bea6b0afa3f36e4efba4851641658b2414" width="500" height="313" ></center>

<center>图1:数据集标注示例</center>

留神到“张“,”三”在这里示意成了“P-B” 和 “P-I”,“P-B”和“P-I”合并成“P” 这个标签。这样重新组合后能够失去以下信息抽取后果:

张三18625584663广东省深圳市南山区百度国内大厦
PTA1A2A3A4

PART B. 相干算法简介

B.1 门控循环单元GRU(Gate Recurrent Unit)

BIGRU是一种经典的循环神经网络(RNN,Recurrent Neural Network),后面一些步骤根本是把该模型当做是黑盒子来用,这里咱们重点解释下其概念和相干原理。一个 RNN 的示意图如下所示,

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/a11f62ff794b4e00985f0fc39f2879bd75377481eedd47ebb489c131bc7bc96c" width="500" height="313" ></center>

<center>图4:RNN示意图 </center>

右边是原始的 RNN,能够看到绿色的点代码输出 x,红色的点代表输入 y,两头的蓝色是 RNN 模型局部。橙色的箭头由本身指向本身,示意 RNN 的输出来自于上时刻的输入,这也是为什么名字中带有循环(Recurrent)这个词。

左边是依照工夫序列开展的示意图,留神到蓝色的 RNN 模块是同一个,只不过在不同的时刻复用了。这时候可能清晰地示意序列标注模型的输入输出。

GRU为了解决长期记忆和反向流传中梯度问题而提出来的,和LSTM一样可能无效对长序列建模,且GRU训练效率更高。

B.2 条件随机场CRF(Conditional Random Fields)

长句子的问题解决了,序列标注工作的另外一个问题也亟待解决,即标签之间的依赖性。举个例子,咱们预测的标签个别不会呈现 P-B,T-I 并列的状况,因为这样的标签不合理,也无奈解析。无论是 RNN 还是 LSTM 都只能尽量不呈现,却无奈从原理上防止这个问题。上面要提到的条件随机场(CRF,Conditional Random Field)却很好的解决了这个问题。

条件随机场这个模型属于概率图模型中的无向图模型,这里咱们不做开展,只直观解释下该模型背地考量的思维。一个经典的链式 CRF 如下图所示,

<center><img width="500px" src="https://ai-studio-static-online.cdn.bcebos.com/961cdb6116de473481db455d375580d3a10d028e50534014973fca6ce6415131" /></center>
<center>图5:CRF示意图 </center>

CRF 实质是一个无向图,其中绿色点示意输出,红色点示意输入。点与点之间的边能够分成两类,一类是 $x$ 与 $y$ 之间的连线,示意其相关性;另一类是相邻时刻的 $y$ 之间的相关性。也就是说,在预测某时刻 $y$ 时,同时要思考相邻的标签解决。当 CRF 模型收敛时,就会学到相似 P-B 和 T-I 作为相邻标签的概率非常低。

PART C. 代码实际

!pip install --upgrade paddlenlp 
import paddleimport paddle.nn as nnimport paddlenlpfrom paddlenlp.datasets import MapDatasetfrom paddlenlp.data import Stack, Tuple, Padfrom paddlenlp.layers import LinearChainCrf, ViterbiDecoder, LinearChainCrfLossfrom paddlenlp.metrics import ChunkEvaluator

C.1 数据筹备

为了训练序列标注模型,个别须要筹备三个数据集:训练集train.txt、验证集dev.txt、测试集test.txt。数据集寄存在data目录中。

  • 训练集,用来训练模型参数的数据集,模型间接依据训练集来调整本身参数以取得更好的分类成果。
  • 验证集,用于在训练过程中测验模型的状态,收敛状况。验证集通常用于调整超参数,依据几组模型验证集上的体现决定哪组超参数领有最好的性能。
  • 测试集,用来计算模型的各项评估指标,验证模型泛化能力。

此外,序列标注模型还依赖以下词典数据,词典数据寄存在conf目录中。

  • 输出文本词典word.dic
  • 对输出文本中特殊字符进行转换的词典q2b.dic
  • 标记标签的词典tag.dic

这里咱们提供一份已标注的快递单要害信息数据集。训练应用的数据也能够由大家本人组织数据。数据格式除了第一行是 text_a\tlabel 固定的结尾,前面的每行数据都是由两列组成,以制表符分隔,第一列是 utf-8 编码的中文文本,以 \002 宰割,第二列是对应每个字的标注,以 \002 宰割。

数据集及词典数据的目录构造如下:

在训练和预测阶段,咱们都须要进行原始数据的预处理,具体解决工作包含:

  1. 从原始数据文件中抽取出句子和标签,结构句子序列和标签序列
  2. 将句子序列中的特殊字符进行转换
  3. 根据词典获取词对应的id索引

看一下训练集
训练集中除第一行是 text_a\tlabel,前面的每行数据都是由两列组成,以制表符分隔,第一列是 utf-8 编码的中文文本,以 \002 宰割,第二列是对应序列标注的后果,以 \002 宰割。

1.1下载并解压数据集

from paddle.utils.download import get_path_from_urlURL = "https://paddlenlp.bj.bcebos.com/paddlenlp/datasets/waybill.tar.gz"get_path_from_url(URL,"./")for i, line in enumerate(open('data/train.txt')):    if 0 < i < 5:        print ('%d: ' % i, line.split()[0])        print ('   ', line.split()[1])
1:  16620200077宣荣嗣甘肃省白银市会宁县河畔镇十字街金海超市西行50米    T-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-IP-BP-IP-IA1-BA1-IA1-IA2-BA2-IA2-IA3-BA3-IA3-IA4-BA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-I2:  13552664307姜骏炜云南省德宏傣族景颇族自治州盈江县平原镇蜜回路下段    T-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-IP-BP-IP-IA1-BA1-IA1-IA2-BA2-IA2-IA2-IA2-IA2-IA2-IA2-IA2-IA2-IA3-BA3-IA3-IA4-BA4-IA4-IA4-IA4-IA4-IA4-IA4-I3:  内蒙古自治区赤峰市阿鲁科尔沁旗汉林西街路南13701085390那峥    A1-BA1-IA1-IA1-IA1-IA1-IA2-BA2-IA2-IA3-BA3-IA3-IA3-IA3-IA3-IA4-BA4-IA4-IA4-IA4-IA4-IT-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-IP-BP-I4:  广东省梅州市大埔县茶阳镇胜利路13601328173张铱    A1-BA1-IA1-IA2-BA2-IA2-IA3-BA3-IA3-IA4-BA4-IA4-IA4-IA4-IA4-IT-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-IP-BP-I

继承paddle.io.Dataset自定义数据集

def convert_tokens_to_ids(tokens, vocab, oov_token=None):    token_ids = []    oov_id = vocab.get(oov_token) if oov_token else None    for token in tokens:        token_id = vocab.get(token, oov_id)        token_ids.append(token_id)    return token_idsdef load_dict(dict_path):    vocab = {}    i = 0    for line in open(dict_path, 'r', encoding='utf-8'):        key = line.strip('\n')        vocab[key] = i        i += 1    return vocabdef load_dataset(datafiles):    def read(data_path):        with open(data_path, 'r', encoding='utf-8') as fp:            next(fp)            for line in fp.readlines():                words, labels = line.strip('\n').split('\t')                words = words.split('\002')                labels = labels.split('\002')                yield words, labels    if isinstance(datafiles, str):        return MapDataset(list(read(datafiles)))    elif isinstance(datafiles, list) or isinstance(datafiles, tuple):        return [MapDataset(list(read(datafile))) for datafile in datafiles]train_ds, dev_ds, test_ds = load_dataset(datafiles=('data/train.txt', 'data/dev.txt', 'data/test.txt'))label_vocab = load_dict('./data/tag.dic')word_vocab = load_dict('./data/word.dic')def convert_example(example):        tokens, labels = example        token_ids = convert_tokens_to_ids(tokens, word_vocab, 'OOV')        label_ids = convert_tokens_to_ids(labels, label_vocab, 'O')        return token_ids, len(token_ids), label_idstrain_ds.map(convert_example)dev_ds.map(convert_example)test_ds.map(convert_example)

1.2结构数据加载器--dataloder

batchify_fn = lambda samples, fn=Tuple(        Pad(axis=0, pad_val=word_vocab.get('OOV')),  # token_ids        Stack(),  # seq_len        Pad(axis=0, pad_val=label_vocab.get('O'))  # label_ids    ): fn(samples)train_loader = paddle.io.DataLoader(        dataset=train_ds,        batch_size=32,        shuffle=True,        drop_last=True,        return_list=True,        collate_fn=batchify_fn)dev_loader = paddle.io.DataLoader(        dataset=dev_ds,        batch_size=32,        drop_last=True,        return_list=True,        collate_fn=batchify_fn)test_loader = paddle.io.DataLoader(        dataset=test_ds,        batch_size=32,        drop_last=True,        return_list=True,        collate_fn=batchify_fn)

C.2 网络构建

随着深度学习的倒退,目前支流的序列化标注工作基于词向量(word embedding)进行示意学习。上面介绍模型的整体训练流程如下,

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/3ff50b7d683840a4b049e623b92f40ca37b190ca7f7b46298a58f302d2c002bc" width="700" height="513" ></center>

<center>图2:训练流程图 </center>

序列标注工作罕用的模型是RNN+CRF。GRU和LSTM都是罕用的RNN单元。这里咱们以Bi-GRU+CRF模型为例,介绍如何应用 PaddlePaddle 定义序列化标注工作的网络结构。如下图所示,GRU的输入能够作为 CRF 的输出,最初 CRF 的输入作为模型整体的预测后果。

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/8a34120bfd304fb69dc7ac7ee795fb11bd95195403174ca59896b050a2a6b30c" width="700" height="513" ></center>

<center>图3:Bi-GRU+CRF</center>

class BiGRUWithCRF(nn.Layer):    def __init__(self,                 emb_size,                 hidden_size,                 word_num,                 label_num,                 use_w2v_emb=False):        super(BiGRUWithCRF, self).__init__()        if use_w2v_emb:            self.word_emb = TokenEmbedding(                extended_vocab_path='./conf/word.dic', unknown_token='OOV')        else:            self.word_emb = nn.Embedding(word_num, emb_size)        self.gru = nn.GRU(emb_size,                          hidden_size,                          num_layers=2,                          direction='bidirectional')        self.fc = nn.Linear(hidden_size * 2, label_num + 2)  # BOS EOS        self.crf = LinearChainCrf(label_num)        self.decoder = ViterbiDecoder(self.crf.transitions)    def forward(self, x, lens):        embs = self.word_emb(x)        output, _ = self.gru(embs)        output = self.fc(output)        _, pred = self.decoder(output, lens)        return output, lens, pred# Define the model netword and its lossnetwork = BiGRUWithCRF(300, 300, len(word_vocab), len(label_vocab))model = paddle.Model(network)

C.3 网络配置

定义网络结构后,须要配置优化器、损失函数、评估指标。

3.1 评估指标

针对每条序列样本的预测后果,序列标注工作将预测后果依照语块(chunk)进行联合并进行评估。评估指标通常有 Precision、Recall 和 F1。

  1. Precision,准确率,也叫查准率,由模型预测正确的个数除以模型总的预测的个数失去,关注模型预测进去的后果准不准
  2. Recall,召回率,又叫查全率, 由模型预测正确的个数除以实在标签的个数失去,关注模型漏了哪些货色
  3. F1,综合评估指标,计算公式如下,$F1 = \frac{2*Precision*Recall}{Precision+Recall}$,同时思考 Precision 和 Recall ,是 Precision 和 Recall 的折中。

paddlenlp.metrics中集成了ChunkEvaluator评估指标,并逐渐丰盛中,

optimizer = paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters())crf_loss = LinearChainCrfLoss(network.crf)chunk_evaluator = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True)model.prepare(optimizer, crf_loss, chunk_evaluator)

C.4 模型训练

model.fit(train_data=train_loader,              eval_data=dev_loader,              epochs=1,              save_dir='./results',              log_freq=1)
step  1/50 - loss: 93.9603 - precision: 0.0000e+00 - recall: 0.0000e+00 - f1: 0.0000e+00 - 20s/stepstep  2/50 - loss: 67.9459 - precision: 0.0000e+00 - recall: 0.0000e+00 - f1: 0.0000e+00 - 29s/stepstep  3/50 - loss: 75.6676 - precision: 0.0000e+00 - recall: 0.0000e+00 - f1: 0.0000e+00 - 30s/stepstep  4/50 - loss: 68.9331 - precision: 0.0000e+00 - recall: 0.0000e+00 - f1: 0.0000e+00 - 30s/stepstep  5/50 - loss: 71.4498 - precision: 0.0000e+00 - recall: 0.0000e+00 - f1: 0.0000e+00 - 29s/stepstep  6/50 - loss: 70.4924 - precision: 0.0000e+00 - recall: 0.0000e+00 - f1: 0.0000e+00 - 28s/stepstep  7/50 - loss: 64.0684 - precision: 0.0145 - recall: 0.0172 - f1: 0.0157 - 27s/stepstep  8/50 - loss: 59.8829 - precision: 0.0253 - recall: 0.0287 - f1: 0.0269 - 27s/stepstep  9/50 - loss: 81.7906 - precision: 0.0368 - recall: 0.0400 - f1: 0.0383 - 27s/stepstep 10/50 - loss: 57.3128 - precision: 0.0462 - recall: 0.0480 - f1: 0.0471 - 26s/stepstep 11/50 - loss: 58.4151 - precision: 0.0521 - recall: 0.0517 - f1: 0.0519 - 27s/stepstep 12/50 - loss: 53.0237 - precision: 0.0522 - recall: 0.0496 - f1: 0.0508 - 27s/stepstep 13/50 - loss: 50.7157 - precision: 0.0535 - recall: 0.0486 - f1: 0.0509 - 27s/stepstep 14/50 - loss: 74.4160 - precision: 0.0550 - recall: 0.0477 - f1: 0.0511 - 27s/stepstep 15/50 - loss: 53.2660 - precision: 0.0550 - recall: 0.0459 - f1: 0.0501 - 28s/stepstep 16/50 - loss: 53.1404 - precision: 0.0542 - recall: 0.0440 - f1: 0.0486 - 28s/stepstep 17/50 - loss: 42.8168 - precision: 0.0512 - recall: 0.0421 - f1: 0.0462 - 28s/stepstep 18/50 - loss: 41.4843 - precision: 0.0489 - recall: 0.0418 - f1: 0.0451 - 28s/stepstep 19/50 - loss: 33.4017 - precision: 0.0520 - recall: 0.0462 - f1: 0.0489 - 27s/stepstep 20/50 - loss: 49.4797 - precision: 0.0556 - recall: 0.0512 - f1: 0.0533 - 28s/stepstep 21/50 - loss: 32.3158 - precision: 0.0579 - recall: 0.0547 - f1: 0.0563 - 28s/stepstep 22/50 - loss: 27.5900 - precision: 0.0588 - recall: 0.0570 - f1: 0.0579 - 28s/stepstep 23/50 - loss: 49.3031 - precision: 0.0605 - recall: 0.0593 - f1: 0.0599 - 28s/stepstep 24/50 - loss: 26.1740 - precision: 0.0626 - recall: 0.0618 - f1: 0.0622 - 28s/stepstep 25/50 - loss: 20.6553 - precision: 0.0639 - recall: 0.0640 - f1: 0.0639 - 28s/stepstep 26/50 - loss: 23.3890 - precision: 0.0632 - recall: 0.0646 - f1: 0.0639 - 28s/stepstep 27/50 - loss: 35.1492 - precision: 0.0661 - recall: 0.0688 - f1: 0.0674 - 28s/stepstep 28/50 - loss: 21.9309 - precision: 0.0696 - recall: 0.0734 - f1: 0.0714 - 29s/stepstep 29/50 - loss: 31.2896 - precision: 0.0719 - recall: 0.0766 - f1: 0.0742 - 29s/stepstep 30/50 - loss: 27.9382 - precision: 0.0776 - recall: 0.0833 - f1: 0.0803 - 29s/stepstep 31/50 - loss: 16.3336 - precision: 0.0837 - recall: 0.0902 - f1: 0.0868 - 29s/stepstep 32/50 - loss: 19.7823 - precision: 0.0920 - recall: 0.0993 - f1: 0.0955 - 29s/stepstep 33/50 - loss: 18.2621 - precision: 0.0978 - recall: 0.1055 - f1: 0.1015 - 29s/stepstep 34/50 - loss: 13.6537 - precision: 0.1074 - recall: 0.1158 - f1: 0.1114 - 30s/stepstep 35/50 - loss: 13.4816 - precision: 0.1142 - recall: 0.1229 - f1: 0.1184 - 30s/stepstep 36/50 - loss: 13.9093 - precision: 0.1216 - recall: 0.1309 - f1: 0.1261 - 30s/stepstep 37/50 - loss: 13.3039 - precision: 0.1296 - recall: 0.1394 - f1: 0.1343 - 30s/stepstep 38/50 - loss: 15.2145 - precision: 0.1367 - recall: 0.1470 - f1: 0.1417 - 30s/stepstep 39/50 - loss: 21.4867 - precision: 0.1389 - recall: 0.1493 - f1: 0.1439 - 30s/stepstep 40/50 - loss: 9.0086 - precision: 0.1424 - recall: 0.1531 - f1: 0.1476 - 30s/stepstep 41/50 - loss: 19.1487 - precision: 0.1506 - recall: 0.1619 - f1: 0.1560 - 29s/stepstep 42/50 - loss: 10.9948 - precision: 0.1602 - recall: 0.1721 - f1: 0.1660 - 29s/stepstep 43/50 - loss: 7.8162 - precision: 0.1690 - recall: 0.1816 - f1: 0.1751 - 29s/stepstep 44/50 - loss: 7.6951 - precision: 0.1747 - recall: 0.1882 - f1: 0.1812 - 29s/stepstep 45/50 - loss: 10.1926 - precision: 0.1795 - recall: 0.1934 - f1: 0.1862 - 29s/stepstep 46/50 - loss: 10.8787 - precision: 0.1842 - recall: 0.1986 - f1: 0.1911 - 29s/stepstep 47/50 - loss: 17.8871 - precision: 0.1899 - recall: 0.2047 - f1: 0.1970 - 29s/stepstep 48/50 - loss: 6.0537 - precision: 0.1985 - recall: 0.2139 - f1: 0.2059 - 29s/stepstep 49/50 - loss: 13.4652 - precision: 0.2067 - recall: 0.2224 - f1: 0.2143 - 29s/stepstep 50/50 - loss: 4.7774 - precision: 0.2127 - recall: 0.2292 - f1: 0.2206 - 29s/stepsave checkpoint at /home/aistudio/results/0Eval begin...step 1/6 - loss: 9.7892 - precision: 0.5659 - recall: 0.6170 - f1: 0.5903 - 22s/stepstep 2/6 - loss: 9.2197 - precision: 0.5596 - recall: 0.6053 - f1: 0.5815 - 21s/stepstep 3/6 - loss: 3.7284 - precision: 0.5744 - recall: 0.6217 - f1: 0.5971 - 22s/stepstep 4/6 - loss: 6.9872 - precision: 0.5669 - recall: 0.6115 - f1: 0.5884 - 21s/stepstep 5/6 - loss: 12.2052 - precision: 0.5749 - recall: 0.6195 - f1: 0.5964 - 21s/stepstep 6/6 - loss: 20.2966 - precision: 0.5668 - recall: 0.6114 - f1: 0.5882 - 21s/stepEval samples: 192save checkpoint at /home/aistudio/results/final

B.5 模型评估

调用model.evaluate,查看序列化标注模型在测试集(test.txt)上的评测后果。

model.evaluate(eval_data=test_loader, log_freq=1)
{'loss': [11.092163], 'precision': 0.5285481239804242, 'recall': 0.5654450261780105, 'f1': 0.5463743676222597}

B.6 预测

利用已有模型,可在未知label的数据集(此处复用测试集test.txt)上进行预测,失去模型预测后果及各label的概率。

def parse_decodes(ds, decodes, lens, label_vocab):    decodes = [x for batch in decodes for x in batch]    lens = [x for batch in lens for x in batch]    id_label = dict(zip(label_vocab.values(), label_vocab.keys()))    outputs = []    for idx, end in enumerate(lens):        sent = ds.data[idx][0][:end]        tags = [id_label[x] for x in decodes[idx][:end]]        sent_out = []        tags_out = []        words = ""        for s, t in zip(sent, tags):            if t.endswith('-B') or t == 'O':                if len(words):                    sent_out.append(words)                tags_out.append(t.split('-')[0])                words = s            else:                words += s        if len(sent_out) < len(tags_out):            sent_out.append(words)        outputs.append(''.join(            [str((s, t)) for s, t in zip(sent_out, tags_out)]))    return outputs
outputs, lens, decodes = model.predict(test_data=test_loader)preds = parse_decodes(test_ds, decodes, lens, label_vocab)print('\n'.join(preds[:5]))
Predict begin...step 6/6 [==============================] - ETA: 39s - 10s/ste - ETA: 20s - 10s/ste - 10s/step           Predict samples: 192('黑龙江省双鸭山市尖山', 'A1')('区八马路与东平行路交叉口北40米韦业涛', 'A4')('18600009172', 'T')('广', 'A1')('西壮族自治区桂林市', 'A3')('雁山区', 'A4')('雁山镇西龙村老年活动中心', 'T')('17610348888', 'P')('15652864561', 'T')('河南省开封市', 'A1')('顺', 'A3')('河回族区', 'A3')('顺河区公园路32号赵本山', 'A4')('河', 'A1')('北省唐山市', 'A1')('玉田县', 'A3')('无终大巷159号', 'A4')('18614253058', 'T')('尚汉生', 'P')('台湾台中市', 'A1')('北区北', 'A3')('区锦新街18号', 'A4')('18511226708', 'T')('蓟丽', 'P')

PART D 优化进阶-应用预训练的词向量优化模型成果

在Baseline版本中,咱们调用了paddle.nn.Embedding获取词的向量示意,有如下特点....
这里,咱们调用paddlenlp.embeddings中内置的向量示意TokenEmbedding,有如下特点...

from paddlenlp.embeddings import TokenEmbedding # EMBdel modeldel predsdel networkclass BiGRUWithCRF2(nn.Layer):    def __init__(self,                 emb_size,                 hidden_size,                 word_num,                 label_num,                 use_w2v_emb=True):        super(BiGRUWithCRF2, self).__init__()        if use_w2v_emb:            self.word_emb = TokenEmbedding(                extended_vocab_path='./data/word.dic', unknown_token='OOV')        else:            self.word_emb = nn.Embedding(word_num, emb_size)        self.gru = nn.GRU(emb_size,                          hidden_size,                          num_layers=2,                          direction='bidirectional')        self.fc = nn.Linear(hidden_size * 2, label_num + 2)  # BOS EOS        self.crf = LinearChainCrf(label_num)        self.decoder = ViterbiDecoder(self.crf.transitions)    def forward(self, x, lens):        embs = self.word_emb(x)        output, _ = self.gru(embs)        output = self.fc(output)        _, pred = self.decoder(output, lens)        return output, lens, prednetwork = BiGRUWithCRF2(300, 300, len(word_vocab), len(label_vocab))model = paddle.Model(network)optimizer = paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters())crf_loss = LinearChainCrfLoss(network.crf)chunk_evaluator = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True)model.prepare(optimizer, crf_loss, chunk_evaluator)model.fit(train_data=train_loader,            eval_data=dev_loader,            epochs=2,            save_dir='./results',            log_freq=1)model.evaluate(eval_data=test_loader)outputs, lens, decodes = model.predict(test_data=test_loader)preds = parse_decodes(test_ds, decodes, lens, label_vocab)print('\n'.join(preds[:5]))

后果展现:

step  1/50 - loss: 106.8512 - precision: 0.0000e+00 - recall: 0.0000e+00 - f1: 0.0000e+00 - 40s/stepstep  2/50 - loss: 76.1544 - precision: 0.0000e+00 - recall: 0.0000e+00 - f1: 0.0000e+00 - 46s/stepstep  3/50 - loss: 55.2403 - precision: 8.7032e-04 - recall: 0.0017 - f1: 0.0012 - 38s/stepstep  4/50 - loss: 51.4921 - precision: 0.0177 - recall: 0.0287 - f1: 0.0219 - 36s/stepstep  5/50 - loss: 44.4759 - precision: 0.0185 - recall: 0.0303 - f1: 0.0230 - 37s/stepstep  6/50 - loss: 36.7307 - precision: 0.0287 - recall: 0.0470 - f1: 0.0356 - 38s/stepstep  7/50 - loss: 42.0166 - precision: 0.0356 - recall: 0.0574 - f1: 0.0440 - 35s/stepstep  8/50 - loss: 26.0813 - precision: 0.0429 - recall: 0.0672 - f1: 0.0524 - 35s/stepstep  9/50 - loss: 45.5005 - precision: 0.0496 - recall: 0.0759 - f1: 0.0600 - 34s/stepstep 10/50 - loss: 38.0507 - precision: 0.0545 - recall: 0.0821 - f1: 0.0655 - 34s/stepstep 11/50 - loss: 36.1499 - precision: 0.0610 - recall: 0.0908 - f1: 0.0730 - 34s/stepstep 12/50 - loss: 35.7203 - precision: 0.0710 - recall: 0.1054 - f1: 0.0848 - 33s/stepstep 13/50 - loss: 19.0685 - precision: 0.0850 - recall: 0.1243 - f1: 0.1010 - 34s/stepstep 14/50 - loss: 52.6246 - precision: 0.0982 - recall: 0.1412 - f1: 0.1158 - 33s/stepstep 15/50 - loss: 14.7894 - precision: 0.1089 - recall: 0.1533 - f1: 0.1273 - 34s/stepstep 16/50 - loss: 17.4838 - precision: 0.1207 - recall: 0.1669 - f1: 0.1401 - 33s/stepstep 17/50 - loss: 29.3147 - precision: 0.1317 - recall: 0.1795 - f1: 0.1519 - 33s/stepstep 18/50 - loss: 42.6249 - precision: 0.1479 - recall: 0.2001 - f1: 0.1700 - 33s/stepstep 19/50 - loss: 20.2456 - precision: 0.1654 - recall: 0.2212 - f1: 0.1892 - 32s/stepstep 20/50 - loss: 17.3777 - precision: 0.1813 - recall: 0.2409 - f1: 0.2069 - 33s/stepstep 21/50 - loss: 14.0044 - precision: 0.1949 - recall: 0.2577 - f1: 0.2219 - 32s/stepstep 22/50 - loss: 8.4412 - precision: 0.2081 - recall: 0.2748 - f1: 0.2369 - 32s/stepstep 23/50 - loss: 7.8763 - precision: 0.2191 - recall: 0.2893 - f1: 0.2493 - 33s/stepstep 24/50 - loss: 59.9970 - precision: 0.2253 - recall: 0.2959 - f1: 0.2558 - 33s/stepstep 25/50 - loss: 4.5188 - precision: 0.2375 - recall: 0.3093 - f1: 0.2687 - 33s/stepstep 26/50 - loss: 6.4966 - precision: 0.2515 - recall: 0.3255 - f1: 0.2838 - 32s/stepstep 27/50 - loss: 18.1271 - precision: 0.2641 - recall: 0.3402 - f1: 0.2974 - 33s/stepstep 28/50 - loss: 6.7708 - precision: 0.2763 - recall: 0.3542 - f1: 0.3104 - 33s/stepstep 29/50 - loss: 4.5298 - precision: 0.2875 - recall: 0.3666 - f1: 0.3223 - 33s/stepstep 30/50 - loss: 14.4948 - precision: 0.3002 - recall: 0.3810 - f1: 0.3358 - 32s/stepstep 31/50 - loss: 5.4182 - precision: 0.3081 - recall: 0.3908 - f1: 0.3446 - 32s/stepstep 32/50 - loss: 5.7784 - precision: 0.3146 - recall: 0.3984 - f1: 0.3516 - 32s/stepstep 33/50 - loss: 3.8710 - precision: 0.3228 - recall: 0.4081 - f1: 0.3605 - 32s/stepstep 34/50 - loss: 6.8506 - precision: 0.3343 - recall: 0.4209 - f1: 0.3727 - 32s/stepstep 35/50 - loss: 5.7423 - precision: 0.3446 - recall: 0.4324 - f1: 0.3836 - 32s/stepstep 36/50 - loss: 6.1668 - precision: 0.3557 - recall: 0.4444 - f1: 0.3952 - 32s/stepstep 37/50 - loss: 1.3950 - precision: 0.3663 - recall: 0.4556 - f1: 0.4061 - 33s/stepstep 38/50 - loss: 9.0988 - precision: 0.3761 - recall: 0.4660 - f1: 0.4162 - 33s/stepstep 39/50 - loss: 2.3518 - precision: 0.3855 - recall: 0.4762 - f1: 0.4261 - 33s/stepstep 40/50 - loss: 1.5858 - precision: 0.3931 - recall: 0.4851 - f1: 0.4343 - 33s/stepstep 41/50 - loss: 4.4779 - precision: 0.3995 - recall: 0.4926 - f1: 0.4412 - 33s/stepstep 42/50 - loss: 5.3739 - precision: 0.4080 - recall: 0.5016 - f1: 0.4500 - 33s/stepstep 43/50 - loss: 6.4863 - precision: 0.4165 - recall: 0.5103 - f1: 0.4587 - 33s/stepstep 44/50 - loss: 0.7340 - precision: 0.4237 - recall: 0.5178 - f1: 0.4660 - 33s/stepstep 45/50 - loss: 9.2797 - precision: 0.4325 - recall: 0.5269 - f1: 0.4751 - 33s/stepstep 46/50 - loss: 2.4467 - precision: 0.4406 - recall: 0.5353 - f1: 0.4834 - 33s/stepstep 47/50 - loss: 0.0000e+00 - precision: 0.4479 - recall: 0.5430 - f1: 0.4909 - 33s/stepstep 48/50 - loss: 1.4507 - precision: 0.4554 - recall: 0.5505 - f1: 0.4984 - 33s/stepstep 49/50 - loss: 1.5355 - precision: 0.4630 - recall: 0.5581 - f1: 0.5061 - 33s/stepstep 50/50 - loss: 1.0122 - precision: 0.4701 - recall: 0.5653 - f1: 0.5134 - 32s/stepsave checkpoint at /home/aistudio/results/0Eval begin...step 1/6 - loss: 0.5040 - precision: 0.9005 - recall: 0.9149 - f1: 0.9077 - 39s/stepstep 2/6 - loss: 0.7692 - precision: 0.9179 - recall: 0.9421 - f1: 0.9299 - 29s/stepstep 3/6 - loss: 0.8040 - precision: 0.9128 - recall: 0.9352 - f1: 0.9239 - 26s/stepstep 4/6 - loss: 0.0000e+00 - precision: 0.9117 - recall: 0.9344 - f1: 0.9229 - 25s/stepstep 5/6 - loss: 1.2178 - precision: 0.9153 - recall: 0.9403 - f1: 0.9276 - 25s/stepstep 6/6 - loss: 11.3115 - precision: 0.9108 - recall: 0.9362 - f1: 0.9233 - 25s/stepEval samples: 192Epoch 2/2step  1/50 - loss: 1.1930 - precision: 0.8077 - recall: 0.8750 - f1: 0.8400 - 29s/stepstep  2/50 - loss: 0.0000e+00 - precision: 0.8784 - recall: 0.9219 - f1: 0.8996 - 24s/stepstep  3/50 - loss: 0.0000e+00 - precision: 0.8917 - recall: 0.9288 - f1: 0.9099 - 26s/stepstep  4/50 - loss: 0.0000e+00 - precision: 0.8953 - recall: 0.9257 - f1: 0.9103 - 28s/stepstep  5/50 - loss: 0.0000e+00 - precision: 0.8936 - recall: 0.9236 - f1: 0.9083 - 27s/stepstep  6/50 - loss: 0.0000e+00 - precision: 0.8817 - recall: 0.9117 - f1: 0.8964 - 26s/stepstep  7/50 - loss: 0.4840 - precision: 0.8911 - recall: 0.9184 - f1: 0.9045 - 26s/stepstep  8/50 - loss: 1.3907 - precision: 0.8952 - recall: 0.9221 - f1: 0.9084 - 26s/stepstep  9/50 - loss: 0.8118 - precision: 0.8989 - recall: 0.9256 - f1: 0.9121 - 26s/stepstep 10/50 - loss: 6.9570 - precision: 0.9014 - recall: 0.9278 - f1: 0.9144 - 26s/stepstep 11/50 - loss: 0.0000e+00 - precision: 0.9069 - recall: 0.9306 - f1: 0.9186 - 27s/stepstep 12/50 - loss: 2.7711 - precision: 0.9062 - recall: 0.9303 - f1: 0.9181 - 28s/stepstep 13/50 - loss: 0.6897 - precision: 0.9111 - recall: 0.9349 - f1: 0.9228 - 28s/stepstep 14/50 - loss: 0.0000e+00 - precision: 0.9140 - recall: 0.9365 - f1: 0.9251 - 28s/stepstep 15/50 - loss: 0.1035 - precision: 0.9124 - recall: 0.9363 - f1: 0.9242 - 28s/stepstep 16/50 - loss: 0.0000e+00 - precision: 0.9126 - recall: 0.9380 - f1: 0.9251 - 28s/stepstep 17/50 - loss: 0.0000e+00 - precision: 0.9141 - recall: 0.9386 - f1: 0.9262 - 28s/stepstep 18/50 - loss: 0.0000e+00 - precision: 0.9160 - recall: 0.9402 - f1: 0.9279 - 29s/stepstep 19/50 - loss: 0.1520 - precision: 0.9187 - recall: 0.9417 - f1: 0.9300 - 29s/stepstep 20/50 - loss: 0.6181 - precision: 0.9179 - recall: 0.9414 - f1: 0.9295 - 30s/stepstep 21/50 - loss: 0.0000e+00 - precision: 0.9160 - recall: 0.9405 - f1: 0.9281 - 30s/stepstep 22/50 - loss: 0.8650 - precision: 0.9169 - recall: 0.9417 - f1: 0.9292 - 30s/stepstep 23/50 - loss: 0.4126 - precision: 0.9181 - recall: 0.9432 - f1: 0.9305 - 30s/stepstep 24/50 - loss: 0.0000e+00 - precision: 0.9198 - recall: 0.9442 - f1: 0.9318 - 30s/stepstep 25/50 - loss: 1.2708 - precision: 0.9200 - recall: 0.9446 - f1: 0.9321 - 30s/stepstep 26/50 - loss: 1.2221 - precision: 0.9209 - recall: 0.9453 - f1: 0.9329 - 30s/stepstep 27/50 - loss: 0.0000e+00 - precision: 0.9204 - recall: 0.9452 - f1: 0.9326 - 30s/stepstep 28/50 - loss: 0.0000e+00 - precision: 0.9181 - recall: 0.9440 - f1: 0.9308 - 30s/stepstep 29/50 - loss: 0.0000e+00 - precision: 0.9195 - recall: 0.9452 - f1: 0.9322 - 30s/stepstep 30/50 - loss: 2.0416 - precision: 0.9196 - recall: 0.9449 - f1: 0.9321 - 29s/stepstep 31/50 - loss: 0.0000e+00 - precision: 0.9197 - recall: 0.9450 - f1: 0.9322 - 30s/stepstep 32/50 - loss: 0.0000e+00 - precision: 0.9211 - recall: 0.9459 - f1: 0.9333 - 30s/stepstep 33/50 - loss: 0.0000e+00 - precision: 0.9207 - recall: 0.9455 - f1: 0.9329 - 30s/stepstep 34/50 - loss: 0.0000e+00 - precision: 0.9212 - recall: 0.9462 - f1: 0.9335 - 30s/stepstep 35/50 - loss: 0.0000e+00 - precision: 0.9225 - recall: 0.9473 - f1: 0.9347 - 30s/stepstep 36/50 - loss: 9.2128 - precision: 0.9197 - recall: 0.9466 - f1: 0.9329 - 30s/stepstep 37/50 - loss: 0.0000e+00 - precision: 0.9198 - recall: 0.9470 - f1: 0.9332 - 30s/stepstep 38/50 - loss: 0.8178 - precision: 0.9206 - recall: 0.9476 - f1: 0.9339 - 30s/stepstep 39/50 - loss: 2.9900 - precision: 0.9218 - recall: 0.9484 - f1: 0.9349 - 30s/stepstep 40/50 - loss: 0.0000e+00 - precision: 0.9222 - recall: 0.9489 - f1: 0.9354 - 30s/stepstep 41/50 - loss: 1.0602 - precision: 0.9237 - recall: 0.9500 - f1: 0.9367 - 31s/stepstep 42/50 - loss: 0.0000e+00 - precision: 0.9234 - recall: 0.9494 - f1: 0.9363 - 31s/stepstep 43/50 - loss: 0.0000e+00 - precision: 0.9243 - recall: 0.9500 - f1: 0.9369 - 31s/stepstep 44/50 - loss: 0.7419 - precision: 0.9258 - recall: 0.9509 - f1: 0.9382 - 30s/stepstep 45/50 - loss: 0.3928 - precision: 0.9264 - recall: 0.9513 - f1: 0.9387 - 30s/stepstep 46/50 - loss: 0.0000e+00 - precision: 0.9270 - recall: 0.9519 - f1: 0.9393 - 30s/stepstep 47/50 - loss: 0.0000e+00 - precision: 0.9278 - recall: 0.9526 - f1: 0.9400 - 30s/stepstep 48/50 - loss: 0.0000e+00 - precision: 0.9282 - recall: 0.9531 - f1: 0.9405 - 30s/stepstep 49/50 - loss: 0.0000e+00 - precision: 0.9281 - recall: 0.9532 - f1: 0.9405 - 30s/stepstep 50/50 - loss: 0.0000e+00 - precision: 0.9292 - recall: 0.9539 - f1: 0.9413 - 30s/stepsave checkpoint at /home/aistudio/results/1Eval begin...step 1/6 - loss: 0.0000e+00 - precision: 0.9531 - recall: 0.9734 - f1: 0.9632 - 24s/stepstep 2/6 - loss: 0.0000e+00 - precision: 0.9689 - recall: 0.9842 - f1: 0.9765 - 25s/stepstep 3/6 - loss: 0.4261 - precision: 0.9672 - recall: 0.9825 - f1: 0.9748 - 25s/stepstep 4/6 - loss: 0.0000e+00 - precision: 0.9650 - recall: 0.9777 - f1: 0.9713 - 23s/stepstep 5/6 - loss: 0.0000e+00 - precision: 0.9679 - recall: 0.9801 - f1: 0.9740 - 22s/stepstep 6/6 - loss: 1.4915 - precision: 0.9680 - recall: 0.9790 - f1: 0.9735 - 23s/stepEval samples: 192save checkpoint at /home/aistudio/results/final
Eval begin...step 6/6 - loss: 0.0000e+00 - precision: 0.9463 - recall: 0.9686 - f1: 0.9573 - 26s/stepEval samples: 192{'loss': [0.0], 'precision': 0.9462915601023018, 'recall': 0.9685863874345549, 'f1': 0.9573091849935316}

模型在验证集上的f1 score较之前有显著晋升。

Predict begin...step 6/6 [==============================] - ETA: 53s - 13s/ste - ETA: 27s - 14s/ste - 14s/step           Predict samples: 192('黑龙江省', 'A1')('双鸭山市', 'A2')('尖山区', 'A3')('八马路与东平行路交叉口北40米', 'A4')('韦业涛', 'P')('18600009172', 'T')('广西壮族自治区', 'A1')('桂林市', 'A2')('雁山区', 'A3')('雁山镇西龙村老年活动中心', 'A4')('17610348888', 'T')('羊卓卫', 'P')('15652864561', 'T')('河南省', 'A1')('开封市', 'A2')('顺河回族区', 'A3')('顺河区公园路32号', 'A4')('赵本山', 'P')('河北省', 'A1')('唐山市', 'A2')('玉田县', 'A3')('无终大巷159号', 'A4')('18614253058', 'T')('尚汉生', 'P')('台湾', 'A1')('台中市', 'A2')('北区', 'A3')('北区锦新街18号', 'A4')('18511226708', 'T')('蓟丽', 'P')