关于机器学习:如何为基于NLP的实体识别模型设置人工审查

目前，各个行业的组织都有大量非结构化数据，供决策团队通过评估取得基于实体的洞察见解。此外，大家可能还心愿增加本人的专有业务实体类型，例如专有整机编号或行业特定术语等。然而为了创立基于自然语言解决（NLP）的模型，咱们首先须要依据这些特定实体进行数据标记。

Amazon SageMaker Ground Truth可能帮忙大家轻松构建起用于机器学习（ML）的高精度训练数据集，Amazon Comprehend则为模型训练作业疾速抉择正确的算法与参数。最初，Amazon Augmented AI（Amazon A2I）使咱们可能审计、核查并裁减得出的预测后果。

本文将介绍如何应用Ground Truth命名实体辨认（Named entity recognition，简称NER）进行特色标记，借此为自定义实体构建起标记数据集；并将介绍如何应用Amazon Comprehend训练一套自定义实体识别器，以及如何应用Amazon A2I提供的人工审核机制对置信度低于特定阈值的Amazon Comprehend预测进行复核。

咱们将应用一套示例Amazon SageMaker Jupyter notebook，在演练中实现以下步骤：

预处理输出文件。
创立一项Ground Truth NER标记作业。
训练Amazon Comprehend自定义实体识别器模型。
应用Amazon A2I设置人工审核循环，借此检测低置信度预测后果。

先决条件

在开始之前，请通过以下步骤设置Jupyter notebook：

在Amazon SageMaker中创立一个Notebook实例。

确保Amazon SageMaker notebook领有Notebook中先决条件局部所提及的必要AWS身份与拜访治理（AWS Identity and Access Management，简称IAM）角色及权限。

在Notebook处于活动状态时，抉择Open Jupyter。
在Jupyter仪表上抉择New, 而后抉择Terminal。
在终端内，输出以下代码：

cd SageMakergit clone “https://github.com/aws-samples/augmentedai-comprehendner-groundtruth”

在augmentedai-comprehendner-groundtruth文件夹中抉择SageMakerGT-ComprehendNER-A2I-Notebook.ipynb以关上该Notebook。

当初，咱们能够在Notebook单元中执行以下操作步骤。

预处理输出文件

在本用例中，大家正在查看聊天音讯或几份提交的工单，心愿弄清它们是否与AWS产品无关。咱们应用Ground Truth中的NER 标记性能，将输出音讯中的SERVICE或VERSION实体标记进去。之后，咱们训练Amazon Comprehend自定义实体识别器，借此从推特或工单正文等文本当中辨认出对应实体。

样本数据集可通过GitHub repo中的data/rawinput/aws-service-offerings.txt处获取。以下截屏所示，为本次演练中应用的数据集示例。

通过对文件进行预处理，咱们将生成以下文件：

inputs.csv – 应用此文件为Ground Truth NER标记生成输出manifest文件。
Train.csv与test.csv – 应用这些文件作为自定义实体训练的输出。咱们能够在Amazon Simple Storage Service (Amazon S3) 存储桶中找到这些文件。

对于数据集生成办法，请参阅Notebook中的步骤1a与1b局部。

创立一项Ground Truth NER标记作业

咱们的指标是对输出文件中的句子进行正文与标记，将其归类于咱们的各自定义实体。在本节中，大家须要实现以下步骤：

创立Ground Truth所须要的manifest文件。
设置标记工作人员。
创立标记作业。
启动标记作业并验证其输入后果。

创立一个manifest文件

咱们应用在预处理过程中生成的inputs.csv文件创建NER标记特色所须要的manifest文件。咱们将生成的manifest文件命名为prefix+-text-input.manifest，用于在创立Ground Truth作业时进行数据标记。详见以下代码：

# Create and upload the input manifest by appending a source tag to each of the lines in the input text file.# Ground Truth uses the manifest file to determine labeling tasksmanifest_name = prefix + '-text-input.manifest'# remove existing file with the same name to avoid duplicate entries!rm *.manifests3bucket = s3res.Bucket(BUCKET)with open(manifest_name, 'w') as f: for fn in s3bucket.objects.filter(Prefix=prefix +'/input/'): fn_obj = s3res.Object(BUCKET, fn.key) for line in fn_obj.get()['Body'].read().splitlines():  f.write('{"source":"' + line.decode('utf-8') +'"}n')f.close()s3.upload_file(manifest_name, BUCKET, prefix + "/manifest/" + manifest_name)

NER标记作业须要将输出manifest位于{"source": "embedded text"}中。下列截图展现了从input.csv生成的input.manifest文件内容。

创立专有标记工作人员

在Ground Truth中，咱们应用专有工作人员创立一套通过标记的数据集。

大家能够在Amazon SageMaker管制台上创立专有工作人员。对于具体操作阐明，请参阅应用Amazon SageMaker Ground Truth与Amazon Comprehend开发NER模型中的创立专有工作团队局部。

或者也能够依照Notebook中的领导分步操作。

在本演练中，咱们应用同一专有工作人员在自定义实体训练实现之后，应用Amazon A2I标记并裁减低置信度数据。

创立一项标记作业

下一步是创立NER标记作业。本文将从新介绍其中的关键步骤。对于更多详细信息，请参阅应用Amazon SageMaker Ground Truth增加数据标记工作流以实现命名实体辨认。

在Amazon SageMaker控制台的Ground Truth之下，抉择Labeling jobs。
抉择Create labeling job。
在Job name局部，输出一个作业名称。
在Input dataset location局部，输出之前创立的输出manifest文件所对应的Amazon S3存储地位(s3://_bucket_//_path-to-your-manifest.json_)。
在Output Dataset Location局部，输出带有输入前缀的S3存储桶（例如s3://_bucket-name_/output)。
在IAM role局部，抉择Create a new Role。
抉择Any S3 Bucket。
抉择Create。
在Task category局部，抉择Text。
抉择Named entity recognition。

抉择Next。
在Worker type局部，抉择Private。
在Private Teams当中，抉择所创立的团队。
在Named Entity Recognition Labeling Tool局部的Enter a brief description of the task地位，输出：Highlight the word or group of words and select the corresponding most appropriate label from the right。
在Instructions对话框中，输出：Your labeling will be used to train an ML model for predictions. Please think carefully on the most appropriate label for the word selection. Remember to label at least 200 annotations per label type。
抉择Bold Italics。
在Labels局部，输出心愿向工作人员展现的标签名称。
抉择Create。

启动标记作业

工作人员（或者是由咱们亲自负责工作人员）将收到一封蕴含登录阐明的电子邮件。

抉择the URL provided and enter your user name and password.

随后将被定向至标记工作UI。

通过为词组抉择标签以实现标记工作。
抉择Submit。

在对所有条目进行过标记之后，UI将主动退出。
要查看作业状态，请在Amazon SageMaker控制台的Ground Truth之下，抉择Labeling jobs。
期待，直至作业状态显示为Complete。

验证正文输入

要验证正文输入，请关上S3存储桶并返回_<S3 Bucket Name>/output/<labeling-job-name>_/manifests/output/output.manifest。咱们能够在这里查看Ground Truth创立的manifest文件。以下截屏所示，为本次演练中的示例条目。

训练一套自定义实体模型

当初，咱们能够应用通过正文的数据集或者之前创立的output.manifest Ground Truth训练一套自定义实体识别器了。本节将疏导大家实现Notebook中提及的具体步骤。

解决通过正文的数据集

大家能够通过实体列表或者正文，为Amazon Comprehend自定义实体提供标签。在本文中，咱们将应用Ground Truth标记作业生成正文内容。大家须要将通过正文的output.manifest文件转换为以下CSV格局：

File, Line, Begin Offset, End Offset, Typedocuments.txt, 0, 0, 11, VERSION

运行Notebook中的以下代码以生成此annotations.csv文件：

# Read the output manifest json and convert into a csv format as expected by Amazon Comprehend Custom Entity Recognizerimport jsonimport csv# this will be the file that will be written by the format conversion code block belowcsvout = 'annotations.csv'with open(csvout, 'w', encoding="utf-8") as nf: csv_writer = csv.writer(nf) csv_writer.writerow(["File", "Line", "Begin Offset", "End Offset", "Type"]) with open("data/groundtruth/output.manifest", "r") as fr: for num, line in enumerate(fr.readlines()): lj = json.loads(line) #print(str(lj)) if lj and labeling_job_name in lj: for ent in lj[labeling_job_name]['annotations']['entities']: csv_writer.writerow([fntrain,num,ent['startOffset'],ent['endOffset'],ent['label'].upper()]) fr.close()nf.close() s3_annot_key = "output/" + labeling_job_name + "/comprehend/" + csvoutupload_to_s3(s3_annot_key, csvout)

下图所示，为该文件的具体内容。

设置一套自定义实体识别器

本文在示例中应用API，但大家能够抉择在Amazon Comprehend管制台上创立辨认与批量剖析作业。对于具体操作阐明，请参阅应用Amazon Comprehend构建自定义实体识别器。

输出以下代码。在s3_train_channel当中应用咱们在预处理阶段生成的train.csv文件，借此进行识别器训练。在s3_annot_channel当中，应用annotations.csv作为标签以训练您的自定义实体识别器。

custom_entity_request = { "Documents": { "S3Uri": s3_train_channel }, "Annotations": { "S3Uri": s3_annot_channel }, "EntityTypes": [ { "Type": "SERVICE" }, { "Type": "VERSION" } ]}

应用CreateEntityRecognizer创立实体识别器。该实体识别器应用最低数量训练样本进行训练，借此生成Amazon A2I工作流中须要的局部低置信度预测后果。详见以下代码：

import datetimeid = str(datetime.datetime.now().strftime("%s"))create_custom_entity_response = comprehend.create_entity_recognizer( RecognizerName = prefix + "-CER", DataAccessRoleArn = role, InputDataConfig = custom_entity_request, LanguageCode = "en")

在实体识别器作业实现之后，咱们将取得一款附带性能分数的识别器。如前所述，咱们应用最低数量的训练样本进行识别器训练，借此生成Amazon A2I工作流中须要的局部低置信度预测后果。咱们能够在Amazon Comprehend管制台上找到这些指标，具体参见以下截屏。

创立一项批量实体检测剖析作业，用以检测大量文件中的相应实体。

应用Amazon Comprehend StartEntitiesDetectionJob操作以检测文件中的自定义实体。对于应用自定义实体识别器创立实时剖析端点的具体操作阐明，请参阅启动Amazon Comprehend自定义实体辨认实时端点以执行正文工作。

要应用EntityRecognizerArn进行自定义实体辨认，咱们须要为识别器提供拜访权限以进行自定义实体检测。执行CreateEntityRecognizer操作即可通过响应后果取得此ARN。

运行自定义实体检测作业，通过Notebook运行以下单元，对预处理步骤当中创立的测试数据集做出预测：

s3_test_channel = 's3://{}/{}'.format(BUCKET, s3_test_key) s3_output_test_data = 's3://{}/{}'.format(BUCKET, "output/testresults/")test_response = comprehend.start_entities_detection_job( InputDataConfig={'S3Uri': s3_test_channel,'InputFormat': 'ONE_DOC_PER_LINE'},OutputDataConfig={'S3Uri': s3_output_test_data},DataAccessRoleArn=role,JobName='a2i-comprehend-gt-blog',EntityRecognizerArn=jobArn,LanguageCode='en')

以下截屏所示，为本次演练中得出的测试后果。

建设人工审核循环

在本节中，咱们将为Amazon A2I中的低置信度检测建设起人工审核循环，具体包含以下步骤：

抉择工作人员。
创立人工工作UI。
创立一项工作人员工作模板创立器函数。
创立流定义。
检查人员循环状态，并期待审核人员实现工作。

抉择工作人员

在本文中，咱们应用由为Ground Truth标记作业创立的专有工作人员。应用工作人员ARN为Amazon A2I设置工作人员。

创立人工工作UI

应用liquid HTML中的UI模板创立人工工作UI资源。每当须要人工循环时，皆须要应用这套模板。

以下示例代码已通过测试，可能与Amazon Comprehend实体检测相兼容：

template = """<script src="https://assets.crowd.aws/crowd-html-elements.js"></script><style> .highlight { background-color: yellow; }</style><crowd-entity-annotation name="crowd-entity-annotation" header="Highlight parts of the text below" labels="[{'label': 'service', 'fullDisplayName': 'Service'}, {'label': 'version', 'fullDisplayName': 'Version'}]" text="{{ task.input.originalText }}">  <full-instructions header="Named entity recognition instructions"> <ol> <li><strong>Read</strong> the text carefully.</li> <li><strong>Highlight</strong> words, phrases, or sections of the text.</li> <li><strong>Choose</strong> the label that best matches what you have highlighted.</li> <li>To <strong>change</strong> a label, choose highlighted text and select a new label.</li> <li>To <strong>remove</strong> a label from highlighted text, choose the X next to the abbreviated label name on the highlighted text.</li> <li>You can select all of a previously highlighted text, but not a portion of it.</li> </ol> </full-instructions> <short-instructions> Select the word or words in the displayed text corresponding to the entity, label it and click submit </short-instructions> <div id="recognizedEntities" style="margin-top: 20px"> <h3>Label the Entity below in the text above</h3> <p>{{ task.input.entities }}</p> </div></crowd-entity-annotation><script> function highlight(text) { var inputText = document.getElementById("inputText"); var innerHTML = inputText.innerHTML; var index = innerHTML.indexOf(text); if (index >= 0) { innerHTML = innerHTML.substring(0,index) + "<span class='highlight'>" + innerHTML.substring(index,index+text.length) + "</span>" + innerHTML.substring(index + text.length); inputText.innerHTML = innerHTML; } } document.addEventListener('all-crowd-elements-ready', () => { document .querySelector('crowd-entity-annotation') .shadowRoot .querySelector('crowd-form') .form .appendChild(recognizedEntities); });</script>"""

创立一项工作人员工作模板创立器函数

此函数属于对Amazon SageMaker软件包办法的高级形象，用于创立人工审核工作流。详见以下代码：

def create_task_ui(): ''' Creates a Human Task UI resource. Returns: struct: HumanTaskUiArn''' response = sagemaker.create_human_task_ui( HumanTaskUiName=taskUIName, UiTemplate={'Content': template}) return response# Task UI name - this value is unique per account and region. You can also provide your own value here.taskUIName = prefix + '-ui'# Create task UIhumanTaskUiResponse = create_task_ui()humanTaskUiArn = humanTaskUiResponse['HumanTaskUiArn']print(humanTaskUiArn)

创立流定义

咱们能够在流定义中指定以下内容：

作为工作接管方的工作人员
工作人员收到的标记批示

本文应用API，但也能够抉择在Amazon A2I管制台上创立这项工作流定义。

对于更多详细信息，请参阅如何创立流定义。

要设置触发人工循环审核的条件，请输出以下代码（能够设置CONFIDENCE_SCORE_THRESHOLD阈值，借此调整触发人工审核的具体置信度）：

human_loops_started = []import jsonCONFIDENCE_SCORE_THRESHOLD = 90for line in data: print("Line is: " + str(line)) begin_offset=line['BEGIN_OFFSET'] end_offset=line['END_OFFSET'] if(line['CONFIDENCE_SCORE'] < CONFIDENCE_SCORE_THRESHOLD): humanLoopName = str(uuid.uuid4()) human_loop_input = {} human_loop_input['labels'] = line['ENTITY'] human_loop_input['entities']= line['ENTITY'] human_loop_input['originalText'] = line['ORIGINAL_TEXT'] start_loop_response = a2i_runtime_client.start_human_loop( HumanLoopName=humanLoopName, FlowDefinitionArn=flowDefinitionArn, HumanLoopInput={ "InputContent": json.dumps(human_loop_input) } ) print(human_loop_input) human_loops_started.append(humanLoopName) print(f'Score is less than the threshold of {CONFIDENCE_SCORE_THRESHOLD}') print(f'Starting human loop with name: {humanLoopName} n') else: print('No human loop created. n')

查看人工循环状态并期待审核人员实现工作

要定义一项查看人工循环状态的函数，请输出以下代码：

completed_human_loops = []for human_loop_name in human_loops_started: resp = a2i_runtime_client.describe_human_loop(HumanLoopName=human_loop_name) print(f'HumanLoop Name: {human_loop_name}') print(f'HumanLoop Status: {resp["HumanLoopStatus"]}') print(f'HumanLoop Output Destination: {resp["HumanLoopOutput"]}') print('n') if resp["HumanLoopStatus"] == "Completed": completed_human_loops.append(resp)

导航至专有工作人员门户（为Notebook在上一步骤中的单元2输入后果），详见以下代码：

workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")print('https://' + sagemaker.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])

这套UI模板相似于Ground Truth NER标记性能。Amazon A2I显示从输出文本中辨认出的实体（即低置信度预测后果）。而后，工作人员能够依据须要更新或验证实体标签，并抉择Submit。

此项操作将生成更新的正文，其中蕴含由人工审核员高亮标注的偏移量与实体。

资源清理

为了防止产生不必要的费用，请在实现本演练后删除相应资源，包含Amazon SageMaker notebook实例、Amazon Comprehend自定义实体识别器，以及Amazon S3当中不再应用的模型工件。

总结

本文演示了如何应用Ground Truth NER为Amazon Comprehend自定义实体辨认后果创立正文。咱们还应用Amazon A2I以更新并改良Amazon Comprehend的低置信度预测后果。

大家能够应用Amazon A2I生成的正文更新所创立的正文文件，并逐渐训练自定义识别器以一直晋升模型精度。

对于视频演示、Jupyter示例Notebook以及更多与用例相干的详细信息，包含文档解决、内容审核、情感剖析与文本翻译等，请参阅Amazon Augmeneted AI资源。期待大家在理论利用中扩大出更多解决方案，也欢迎您提供反馈与倡议。