关于算法:UIESlim满足工业应用场景解决推理部署耗时问题提升效能

我的项目链接：fork 一下即可
UIE Slim 满足工业利用场景，解决推理部署耗时问题，晋升效力！
如果有图片缺失查看原我的项目

在 UIE 弱小的抽取能力背地，同样须要较大的算力反对计算。在一些工业利用场景中对性能的要求较高，若不能无效压缩则无奈理论利用。因而，基于数据蒸馏技术构建了 UIE Slim 数据蒸馏零碎。其原理是通过数据作为桥梁，将 UIE 模型的常识迁徙到关闭域信息抽取小模型，以达到精度损失较小的状况下却能达到大幅度预测速度晋升的成果。

FasterTokenizer 是一款简略易用、功能强大的跨平台高性能文本预处理库，集成业界多个罕用的 Tokenizer 实现，反对不同 NLP 场景下的文本预处理性能，如文本分类、浏览了解，序列标注等。联合 PaddleNLP Tokenizer 模块，为用户在训练、推理阶段提供高效通用的文本预处理能力。use_faster: 应用 C ++ 实现的高性能分词算子 FasterTokenizer 进行文本预处理减速

UIE 数据蒸馏三步

Step 1: 应用 UIE 模型对标注数据进行 finetune，失去 Teacher Model。
Step 2: 用户提供大规模无标注数据，需与标注数据同源。应用 Taskflow UIE 对无监督数据进行预测。
Step 3: 应用标注数据以及步骤 2 失去的合成数据训练出关闭域 Student Model。

成果展现：

测试硬件状况：

1 点算力卡对应的：
V100 32GB
GPUTesla V100
Video Mem32GB
CPU4 Cores
RAM32GB
Disk100GB

模型	模型计算运行工夫	precision	recall	F1
uie-base	68.61049008s	0.69277	0.72327	0.70769
uie-mini	28.932519437s	0.74138	0.54088	0.62545
uie-micro	26.36701917	0.74757	0.48428	0.58779
uie-nano	24.8937761	0.74286	0.49057	0.59091
蒸馏 mini	6.839258904s	0.7732	0.75	0.76142
蒸馏 micro	6.776990s	0.78261	0.72	0.75
蒸馏 nano	6.6231770s	0.7957	0.74	0.76684

模型计算运行工夫：

| 模型 | 模型计算运行工夫 | 提速 x 倍 |
| ——– | ——– | ——– |
| UIE base | 203.95947s | 1|
| UIE base + FasterTokenizer | 177.1798s | 1.15|
| UIE 蒸馏 mini | 21.97979s |9.28 |
| UIE 蒸馏 mini + FasterTokenizer | 20.1557s |10.12 |

Archive: data.zip

inflating: ./data/unlabeled_data.txt
inflating: ./data/doccano_ext.json

示例数据蕴含以下两局部：

名称	数量
doccano 格局标注数据（doccano_ext.json）	200
无标注数据（unlabeled_data.txt）	1277

具体参数以及 doccano 标注细节参考文档：

Paddlenlp 之 UIE 模型实战实体抽取工作【打车数据、快递单】

[PaddleNLP 之 UIE 信息抽取小样本进阶(二)[含 doccano 详解]](https://aistudio.baidu.com/ai…)

Paddlenlp 之 UIE 分类模型【以情感偏向剖析新闻分类为例】含智能标注计划）

Paddlenlp 之 UIE 关系抽取模型【高管关系抽取为例】

UIE：模型

模型	构造	语言	大小
uie-base (默认)	12-layers, 768-hidden, 12-heads	中文	118M
uie-base-en	12-layers, 768-hidden, 12-heads	英文	118M
uie-medical-base	12-layers, 768-hidden, 12-heads	中文
uie-medium	6-layers, 768-hidden, 12-heads	中文	75M
uie-mini	6-layers, 384-hidden, 12-heads	中文	27M
uie-micro	4-layers, 384-hidden, 12-heads	中文	23M
uie-nano	4-layers, 312-hidden, 12-heads	中文	18M
uie-m-large	24-layers, 1024-hidden, 16-heads	中、英文	理论大小 2G
uie-m-base	12-layers, 768-hidden, 12-heads	中、英文	理论大小 1G

理论模型大小解释：

base 模型 118M parameters 是指 base 模型的参数个数，因为同一个模型能够被不同的精度来示意，例如 float16，float32，下载下来是 450M 左右(存储空间大小)，是因为下载的模型是 float32，118M * 4 大略是存储空间的量级。

!python finetune.py \
    --train_path "./data/train.txt" \
    --dev_path "./data/dev.txt" \
    --save_dir "./checkpoint" \
    --learning_rate 5e-6  \
    --batch_size 16 \
    --max_seq_len 512 \
    --num_epochs 10 \
    --model "uie-base" \
    --seed 1000 \
    --logging_steps 10 \
    --valid_steps 50 \
    --device "gpu"

base 模型局部后果展现：

[2022-09-08 17:26:55,701] [INFO] - Evaluation precision: 0.69375, recall: 0.69811, F1: 0.69592
[2022-09-08 17:27:01,145] [INFO] - global step 260, epoch: 9, loss: 0.00172, speed: 1.84 step/s
[2022-09-08 17:27:06,448] [INFO] - global step 270, epoch: 9, loss: 0.00168, speed: 1.89 step/s
[2022-09-08 17:27:12,102] [INFO] - global step 280, epoch: 10, loss: 0.00165, speed: 1.77 step/s
[2022-09-08 17:27:17,607] [INFO] - global step 290, epoch: 10, loss: 0.00162, speed: 1.82 step/s
[2022-09-08 17:27:22,899] [INFO] - global step 300, epoch: 10, loss: 0.00159, speed: 1.89 step/s
[2022-09-08 17:27:26,577] [INFO] - Evaluation precision: 0.69277, recall: 0.72327, F1: 0.70769
[2022-09-08 17:27:26,577] [INFO] - best F1 performence has been updated: 0.69841 --> 0.70769

用户提供大规模无标注数据，需与标注数据同源。应用 Taskflow UIE 对无监督数据进行预测。

References：

GlobalPointer：用对立的形式解决嵌套和非嵌套 NER：

GPLinker：基于 GlobalPointer 的实体关系联结抽取

GPLinker_pytorch

CBLUE

%cd /home/aistudio/data_distill
!python data_distill.py \
    --data_path /home/aistudio/data \
    --save_dir student_data \
    --task_type relation_extraction \
    --synthetic_ratio 10 \
    --model_path /home/aistudio/checkpoint/model_best

可配置参数阐明：

data_path: 标注数据（doccano_ext.json）及无监督文本（unlabeled_data.txt）门路。
model_path: 训练好的 UIE 定制模型门路。
save_dir: 学生模型训练数据保留门路。
synthetic_ratio: 管制合成数据的比例。最大合成数据数量 =synthetic_ratio* 标注数据数量。
task_type: 抉择工作类型，可选有 entity_extraction，relation_extraction，event_extraction 和 opinion_extraction。因为是关闭域信息抽取，需指定工作类型。
seed: 随机种子，默认为 1000。

 parser.add_argument("--data_path", default="../data", type=str, help="The directory for labeled data with doccano format and the large scale unlabeled data.")
    parser.add_argument("--model_path", type=str, default="../checkpoint/model_best", help="The path of saved model that you want to load.")
    parser.add_argument("--save_dir", default="./distill_task", type=str, help="The path of data that you wanna save.")
    parser.add_argument("--synthetic_ratio", default=10, type=int, help="The ratio of labeled and synthetic samples.")
    parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.")
    parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization")

可配置参数阐明：

model_path: 训练好的 UIE 定制模型门路。
test_path: 测试数据集门路。
label_maps_path: 学生模型标签字典。
batch_size: 批处理大小，默认为 8。
max_seq_len: 最大文本长度，默认为 256。
task_type: 抉择工作类型，可选有 entity_extraction，relation_extraction，event_extraction 和 opinion_extraction。因为是关闭域信息抽取的评估，需指定工作类型。

parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.")
    parser.add_argument("--test_path", type=str, default=None, help="The path of test set.")
    parser.add_argument("--encoder", default="ernie-3.0-base-zh", type=str, help="Select the pretrained encoder model for GP.")
    parser.add_argument("--label_maps_path", default="./ner_data/label_maps.json", type=str, help="The file path of the labels dictionary.")
    parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.")
    parser.add_argument("--max_seq_len", type=int, default=128, help="The maximum total input sequence length after tokenization.")
    parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction",

底座模型能够参考上面进行替换！

!python train.py \
    --task_type relation_extraction \
    --train_path student_data/train_data.json \
    --dev_path student_data/dev_data.json \
    --label_maps_path student_data/label_maps.json \
    --num_epochs 200 \
    --encoder ernie-3.0-mini-zh

# %cd /home/aistudio/data_distill
!python train.py \
    --task_type relation_extraction \
    --train_path student_data/train_data.json \
    --dev_path student_data/dev_data.json \
    --label_maps_path student_data/label_maps.json \
    --num_epochs 100 \
    --encoder ernie-3.0-mini-zh\
    --device "gpu"\
    --valid_steps 100\
    --logging_steps 10\
    --save_dir './checkpoint2'\
    --batch_size 16

通过 Taskflow 一键部署关闭域信息抽取模型，task_path 为学生模型门路。

demo 测试

from pprint import pprint
from paddlenlp import Taskflow

ie = Taskflow("information_extraction", model="uie-data-distill-gp", task_path="checkpoint2/model_best/") # Schema 在闭域信息抽取中是固定的
pprint(ie("登革热 @后果 升高 ### 血清白蛋白程度 查看 后果 查看 在资源匮乏地区和富足地区，对有症状患者均应晚期检测。"))

[{'疾病': [{'end': 3,
          'probability': 0.9995957,
          'relations': {'实验室查看': [{'end': 21,
                                   'probability': 0.99892455,
                                   'relations': {},
                                   'start': 14,
                                   'text': '血清白蛋白程度'}],
                        '影像学查看': [{'end': 21,
                                   'probability': 0.99832386,
                                   'relations': {},
                                   'start': 14,
                                   'text': '血清白蛋白程度'}]},
          'start': 0,
          'text': '登革热'}]}]

from pprint import pprint
import json
from paddlenlp.taskflow import Taskflow
import pandas as pd
#运行工夫
import time


def openreadtxt(file_name):
    data = []
    file = open(file_name,'r',encoding='UTF-8')  #关上文件
    file_data = file.readlines() #读取所有行
    for row in file_data:
        data.append(row) #将每行数据插入 data 中     
    return data

# 工夫 1
old_time = time.time()
data_input=openreadtxt('/home/aistudio/ 数据集 /unlabeled_data.txt')


few_ie = Taskflow("information_extraction", model="uie-data-distill-gp", task_path="/home/aistudio/data_distill/checkpoint2/model_best",batch_size=32) # Schema 在闭域信息抽取中是固定的


# 工夫 1
current_time = time.time()
print("数据模型载入运行工夫为" + str(current_time - old_time) + "s")

#工夫 2
old_time1 = time.time()
results=few_ie(data_input)
current_time1 = time.time()
print("模型计算运行工夫为" + str(current_time1 - old_time1) + "s")
#工夫 2

#工夫三
old_time3 = time.time()
test = pd.DataFrame(data=results)
test.to_csv('/home/aistudio/output/reslut.txt', sep='\t', index=False,header=False) #本地

# with open("/home/aistudio/output/reslut.txt", "w+",encoding='UTF-8') as f:    #a :   写入文件，若文件不存在则会先创立再写入，但不会笼罩原文件，而是追加在文件开端
#     for result in results:
#         line = json.dumps(result, ensure_ascii=False)  #对中文默认应用的 ascii 编码. 想输入真正的中文须要指定 ensure_ascii=False
#         f.write(line + "\n")
current_time3 = time.time()
print("数据导出运行工夫为" + str(current_time3 - old_time3) + "s")

# for idx, text in enumerate(data):
#     print('Data: {} \t Lable: {}'.format(text[0], results[idx]))
print("数据后果已导出")

**mini 运行工夫：**

数据模型载入运行工夫为 0.8430757522583008s

模型计算运行工夫为 6.839258909225464s

数据导出运行工夫为 0.008304595947265625s

**nano 运行工夫：**
数据模型载入运行工夫为 0.5164840221405029s

模型计算运行工夫为 6.6231770515441895s

数据导出运行工夫为 0.023623943328857422s

**micro 运行工夫：**

数据模型载入运行工夫为 0.5323500633239746s

模型计算运行工夫为 6.77699007987976s

数据导出运行工夫为 0.04320549964904785s

关闭域 UIE 的 schema 是固定的，能够在 label_maps.json 查看

0:"手术医治"
1:"实验室查看"
2:"影像学查看"

from pprint import pprint
import json
from paddlenlp import Taskflow
import pandas as pd
#运行工夫
import time


def openreadtxt(file_name):
    data = []
    file = open(file_name,'r',encoding='UTF-8')  #关上文件
    file_data = file.readlines() #读取所有行
    for row in file_data:
        data.append(row) #将每行数据插入 data 中     
    return data

# 工夫 1
old_time = time.time()
data_input=openreadtxt('/home/aistudio/ 数据集 /unlabeled_data.txt')

schema = {'疾病': ['手术医治', '实验室查看', '影像学查看']}
# few_ie = Taskflow('information_extraction', schema=schema, batch_size=32,task_path='/home/aistudio/checkpoint_mini/model_best') #自行切换
few_ie = Taskflow('information_extraction', schema=schema, batch_size=32,task_path='/home/aistudio/checkpoint_micro/model_best')
# 工夫 1
current_time = time.time()
print("数据模型载入运行工夫为" + str(current_time - old_time) + "s")

#工夫 2
old_time1 = time.time()
results=few_ie(data_input)
current_time1 = time.time()
print("模型计算运行工夫为" + str(current_time1 - old_time1) + "s")
#工夫 2

#工夫三
old_time3 = time.time()
test = pd.DataFrame(data=results)
test.to_csv('/home/aistudio/output/reslut.txt', sep='\t', index=False,header=False) #本地

# with open("/home/aistudio/output/reslut.txt", "w+",encoding='UTF-8') as f:    #a :   写入文件，若文件不存在则会先创立再写入，但不会笼罩原文件，而是追加在文件开端
#     for result in results:
#         line = json.dumps(result, ensure_ascii=False)  #对中文默认应用的 ascii 编码. 想输入真正的中文须要指定 ensure_ascii=False
#         f.write(line + "\n")
current_time3 = time.time()
print("数据导出运行工夫为" + str(current_time3 - old_time3) + "s")

# for idx, text in enumerate(data):
#     print('Data: {} \t Lable: {}'.format(text[0], results[idx]))
print("数据后果已导出")

通过上述程序自行切换：加载对应模型

记录推理工夫：**uie-nano**
数据模型载入运行工夫为 0.3770780563354492s
模型计算运行工夫为 24.893776178359985s
数据导出运行工夫为 0.01157689094543457s

**uie-micro**
数据模型载入运行工夫为 0.39632749557495117s
模型计算运行工夫为 26.367019176483154s
数据导出运行工夫为 0.012260198593139648s

**uie-mini**

数据模型载入运行工夫为 0.5642790794372559s

模型计算运行工夫为 28.93251943588257s

数据导出运行工夫为 0.01435089111328125s

**uie-base**

数据模型载入运行工夫为 1.4756040573120117s

模型计算运行工夫为 68.61049008369446s

数据导出运行工夫为 0.02205801010131836s

FasterTokenizer 是一款简略易用、功能强大的跨平台高性能文本预处理库，集成业界多个罕用的 Tokenizer 实现，反对不同 NLP 场景下的文本预处理性能，如文本分类、浏览了解，序列标注等。联合 PaddleNLP Tokenizer 模块，为用户在训练、推理阶段提供高效通用的文本预处理能力。

use_faster: 应用 C ++ 实现的高性能分词算子 FasterTokenizer 进行文本预处理减速。须要通过 pip install faster_tokenizer 装置 FasterTokenizer 库前方可应用。默认为False。更多应用阐明可参考[FasterTokenizer 文档]

https://github.com/PaddlePadd…

个性

高性能。因为底层采纳 C ++ 实现，所以其性能远高于目前惯例 Python 实现的 Tokenizer。在文本分类工作上，FasterTokenizer 比照 Python 版本 Tokenizer 减速比最高可达 20 倍。
跨平台。FasterTokenizer 可在不同的零碎平台上应用，目前已反对 Windows x64，Linux x64 以及 MacOS 10.14+ 平台上应用。
多编程语言反对。FasterTokenizer 提供在 C ++、Python 语言上开发的能力。
灵活性强。用户能够通过指定不同的 FasterTokenizer 组件定制满足需要的 Tokenizer。

FAQ

Q：我在 AutoTokenizer.from_pretrained 接口上曾经关上 use_faster=True 开关，为什么文本预处理阶段性能上如同没有任何变动？

A：在有三种状况下，关上 use_faster=True 开关可能无奈晋升性能：

没有装置 faster_tokenizer。若在没有装置 faster_tokenizer 库的状况下关上 use_faster 开关，PaddleNLP 会给出以下 warning：”Can’t find the faster_tokenizer package, please ensure install faster_tokenizer correctly. “。
加载的 Tokenizer 类型暂不反对 Faster 版本。目前反对 4 种 Tokenizer 的 Faster 版本，别离是 BERT、ERNIE、TinyBERT 以及 ERNIE-M Tokenizer。若加载不反对 Faster 版本的 Tokenizer 状况下关上 use_faster 开关，PaddleNLP 会给出以下 warning：”The tokenizer XXX doesn’t have the faster version. Please check the map paddlenlp.transformers.auto.tokenizer.FASTER_TOKENIZER_MAPPING_NAMES to see which faster tokenizers are currently supported.”
待切词文本长度过短（如文本均匀长度小于 5）。这种状况下切词开销可能不是整个文本预处理的性能瓶颈，导致在应用 FasterTokenizer 后仍无奈晋升整体性能。

把 paddlenlp 间接装到指定门路而后 批改对应文件 ;
详情参考这个 PR：

Add use_faster flag for uie of taskflow.

间接找到 pr 批改后的版本，从 giuhub 拉去过去：链接参考

https://github.com/joey12300/…

from pprint import pprint
import json
from paddlenlp.taskflow import Taskflow
import pandas as pd
#运行工夫
import time


def openreadtxt(file_name):
    data = []
    file = open(file_name,'r',encoding='UTF-8')  #关上文件
    file_data = file.readlines() #读取所有行
    for row in file_data:
        data.append(row) #将每行数据插入 data 中     
    return data

# 工夫 1
old_time = time.time()
data_input=openreadtxt('/home/aistudio/ 数据集 /unlabeled_data-Copy1.txt')

few_ie = Taskflow("information_extraction", model="uie-data-distill-gp", task_path="/home/aistudio/data_distill/checkpoint2/model_best",use_faster=True,batch_size=32) # Schema 在闭域信息抽取中是固定的
# few_ie = Taskflow("information_extraction", model="uie-data-distill-gp", task_path="/home/aistudio/data_distill/checkpoint2/model_best",batch_size=32) # Schema 在闭域信息抽取中是固定的

# schema = {'疾病': ['手术医治', '实验室查看', '影像学查看']}
# few_ie = Taskflow('information_extraction', schema=schema, batch_size=32,use_faster=True,task_path='/home/aistudio/checkpoint/model_best')
# few_ie = Taskflow('information_extraction', schema=schema, batch_size=32,task_path='/home/aistudio/checkpoint/model_best')

# 工夫 1
current_time = time.time()
print("数据模型载入运行工夫为" + str(current_time - old_time) + "s")

#工夫 2
old_time1 = time.time()
results=few_ie(data_input)
current_time1 = time.time()
print("模型计算运行工夫为" + str(current_time1 - old_time1) + "s")
#工夫 2

#工夫三
old_time3 = time.time()
test = pd.DataFrame(data=results)
test.to_csv('/home/aistudio/output/reslut.txt', sep='\t', index=False,header=False) #本地

# with open("/home/aistudio/output/reslut.txt", "w+",encoding='UTF-8') as f:    #a :   写入文件，若文件不存在则会先创立再写入，但不会笼罩原文件，而是追加在文件开端
#     for result in results:
#         line = json.dumps(result, ensure_ascii=False)  #对中文默认应用的 ascii 编码. 想输入真正的中文须要指定 ensure_ascii=False
#         f.write(line + "\n")
current_time3 = time.time()
print("数据导出运行工夫为" + str(current_time3 - old_time3) + "s")

# for idx, text in enumerate(data):
#     print('Data: {} \t Lable: {}'.format(text[0], results[idx]))
print("数据后果已导出")

数据样本增大为原来的三倍：unlabeled_data-Copy1.txt

UIE base

数据模型载入运行工夫为 1.6006419658660889s

模型计算运行工夫为 203.95947885513306s

数据导出运行工夫为 0.07103896141052246s

UIE base + FasterTokenizer

数据模型载入运行工夫为 1.6196515560150146s

模型计算运行工夫为 177.17986011505127s

数据导出运行工夫为 0.07898902893066406s

UIE 蒸馏 mini

数据模型载入运行工夫为 0.8441095352172852s

模型计算运行工夫为 21.979790925979614s

数据导出运行工夫为 0.02339339256286621s

UIE 蒸馏 mini + FasterTokenizer

数据模型载入运行工夫为 0.7269768714904785s

模型计算运行工夫为 20.155770540237427s

数据导出运行工夫为 0.012202978134155273s

测试硬件状况：

1 点算力卡对应的：
V100 32GB
GPUTesla V100
Video Mem32GB
CPU4 Cores
RAM32GB
Disk100GB

模型	模型计算运行工夫	precision	recall	F1
uie-base	68.61049008s	0.69277	0.72327	0.70769
uie-mini	28.932519437s	0.74138	0.54088	0.62545
uie-micro	26.36701917	0.74757	0.48428	0.58779
uie-nano	24.8937761	0.74286	0.49057	0.59091
蒸馏 mini	6.839258904s	0.7732	0.75	0.76142
蒸馏 micro	6.776990s	0.78261	0.72	0.75
蒸馏 nano	6.6231770s	0.7957	0.74	0.76684

模型计算运行工夫：

1. 能够看出 UIE 蒸馏在小网络下，性能差不多能够按需抉择。可能会在更大工作性能会更好点

2. 这里 uie-base 等只简略运行了 10 个 epoch，能够多训练会晋升性能

3. 个别学生模型会抉择参数量比拟小的，UIE 蒸馏版是 schema 并行推理的，速度会比 UIE 快很多，特地是 schema 比拟多以及关系抽取等须要多阶段推理的状况

1.FasterTokenizer 减速，paddlenlp2.4.0 版本目前还不反对，只有参考 PR 改下源码

2. 关闭域 UIE 的话 schema 是固定的，能够在label_maps.json 查看，目前反对实体抽取、关系抽取、观点抽取和事件抽取，句子级情感分类目前蒸馏还不反对

3. 想要更快的推理换下学生模型的 backbone 就行

感激

感激 paddlenlp 工作人员 @linjieccc 的反对，承受了 issue 并创立了pull request：fix data distill for UIE #3231 https://github.com/PaddlePadd…

Add use_faster flag for uie of taskflow. #3194

瞻望：

后续对 FasterTokenizer 进行补充；以及钻研一下 UIE 模型的量化、剪枝、NAS

我的项目链接：fork 一下即可
UIE Slim 满足工业利用场景，解决推理部署耗时问题，晋升效力！
如果有图片缺失查看原我的项目

关于算法:UIESlim满足工业应用场景解决推理部署耗时问题提升效能

UIE Slim 满足工业利用场景，解决推理部署耗时问题，晋升效力

1. 进行预训练微调，失去 Teacher Model

2. 离线蒸馏

2.1 通过训练好的 UIE 定制模型预测无监督数据的标签

2.3 学生模型训练

3.Taskflow 部署学生模型以及性能测试

4 进行预训练模型 UIE-mini 并测试推理工夫

5. 提前尝鲜 UIE FasterTokenizer 减速，晋升推理性能

5.1 计划一

5.2 计划二

5.3UIE FasterTokenizer 减速，晋升推理性能

6. 总结