关于自然语言处理:微调预训练的-NLP-模型

6次阅读

共计 5750 个字符，预计需要花费 15 分钟才能阅读完成。

动动发财的小手，点个赞吧！

针对任何畛域微调预训练 NLP 模型的分步指南

在当今世界，预训练 NLP 模型的可用性极大地简化了应用深度学习技术对文本数据的解释。然而，尽管这些模型在个别工作中表现出色，但它们往往不足对特定畛域的适应性。本综合指南旨在疏导您实现微调预训练 NLP 模型的过程，以进步特定畛域的性能。

只管 BERT 和通用句子编码器 (USE) 等预训练 NLP 模型能够无效捕捉语言的复杂性，但因为训练数据集的范畴不同，它们在特定畛域利用中的性能可能会受到限制。当剖析特定畛域内的关系时，这种限度变得显著。

例如，在解决待业数据时，咱们心愿模型可能辨认“数据科学家”和“机器学习工程师”角色之间的更靠近，或者“Python”和“TensorFlow”之间更强的关联。可怜的是，通用模型经常疏忽这些奥妙的关系。

下表展现了从根本多语言 USE 模型取得的相似性的差别：

为了解决这个问题，咱们能够应用高质量的、特定畛域的数据集来微调预训练的模型。这一适应过程显着加强了模型的性能和精度，充沛开释了 NLP 模型的后劲。

在解决大型预训练 NLP 模型时，倡议首先部署根本模型，并仅在其性能无奈满足以后特定问题时才思考进行微调。

本教程重点介绍应用易于拜访的开源数据微调通用句子编码器 (USE) 模型。

能够通过监督学习和强化学习等各种策略来微调 ML 模型。在本教程中，咱们将专一于一次（几次）学习办法与用于微调过程的暹罗架构相结合。

在本教程中，咱们应用暹罗神经网络，它是一种特定类型的人工神经网络。该网络利用共享权重，同时解决两个不同的输出向量来计算可比拟的输入向量。受一次性学习的启发，这种办法已被证实在捕捉语义相似性方面特地无效，只管它可能须要更长的训练工夫并且不足概率输入。

连体神经网络创立了一个“嵌入空间”，其中相干概念严密定位，使模型可能更好地分别语义关系。

双分支和共享权重：该架构由两个雷同的分支组成，每个分支都蕴含一个具备共享权重的嵌入层。这些双分支同时解决两个输出，无论是类似的还是不类似的。
相似性和转换：应用事后训练的 NLP 模型将输出转换为向量嵌入。而后该架构计算向量之间的类似度。类似度得分（范畴在 -1 到 1 之间）量化两个向量之间的角间隔，作为它们语义类似度的度量。
比照损失和学习：模型的学习以“比照损失”为领导，即预期输入（训练数据的类似度得分）与计算出的类似度之间的差别。这种损失领导模型权重的调整，以最大限度地缩小损失并进步学习嵌入的品质。

为了应用此办法对预训练的 NLP 模型进行微调，训练数据应由文本字符串对组成，并附有它们之间的类似度分数。

训练数据遵循如下所示的格局：

在本教程中，咱们应用源自 ESCO 分类数据集的数据集，该数据集已转换为基于不同数据元素之间的关系生成相似性分数。

筹备训练数据是微调过程中的关键步骤。假如您有权拜访所需的数据以及将其转换为指定格局的办法。因为本文的重点是演示微调过程，因而咱们将省略如何应用 ESCO 数据集生成数据的详细信息。

ESCO 数据集可供开发人员自在应用，作为各种应用程序的根底，这些应用程序提供主动实现、倡议零碎、职位搜索算法和职位匹配算法等服务。本教程中应用的数据集已被转换并作为示例提供，容许不受限制地用于任何目标。

让咱们首先查看训练数据：

import pandas as pd

# Read the CSV file into a pandas DataFrame
data = pd.read_csv("./data/training_data.csv")

# Print head
data.head()

首先，咱们建设多语言通用句子编码器作为咱们的基线模型。在进行微调过程之前，必须设置此基线。

在本教程中，咱们将应用 STS 基准和相似性可视化示例作为指标来评估通过微调过程实现的更改和改良。

STS 基准数据集由英语句子对组成，每个句子对都与类似度得分相关联。在模型训练过程中，咱们评估模型在此基准集上的性能。每次训练运行的长久分数是数据集中预测相似性分数和理论相似性分数之间的皮尔逊相关性。

这些分数确保当模型依据咱们特定于上下文的训练数据进行微调时，它放弃肯定水平的通用性。

# Loads the Universal Sentence Encoder Multilingual module from TensorFlow Hub.
base_model_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual/3"
base_model = tf.keras.Sequential([
    hub.KerasLayer(base_model_url,
                   input_shape=[],
                   dtype=tf.string,
                   trainable=False)
])

# Defines a list of test sentences. These sentences represent various job titles.
test_text = ['Data Scientist', 'Data Analyst', 'Data Engineer',
             'Nurse Practitioner', 'Registered Nurse', 'Medical Assistant',
             'Social Media Manager', 'Marketing Strategist', 'Product Marketing Manager']

# Creates embeddings for the sentences in the test_text list. 
# The np.array() function is used to convert the result into a numpy array.
# The .tolist() function is used to convert the numpy array into a list, which might be easier to work with.
vectors = np.array(base_model.predict(test_text)).tolist()

# Calls the plot_similarity function to create a similarity plot.
plot_similarity(test_text, vectors, 90, "base model")

# Computes STS benchmark score for the base model
pearsonr = sts_benchmark(base_model)
print("STS Benachmark:" + str(pearsonr))

下一步波及应用基线模型构建暹罗模型架构，并应用咱们的特定畛域数据对其进行微调。

# Load the pre-trained word embedding model
embedding_layer = hub.load(base_model_url)

# Create a Keras layer from the loaded embedding model
shared_embedding_layer = hub.KerasLayer(embedding_layer, trainable=True)

# Define the inputs to the model
left_input = keras.Input(shape=(), dtype=tf.string)
right_input = keras.Input(shape=(), dtype=tf.string)

# Pass the inputs through the shared embedding layer
embedding_left_output = shared_embedding_layer(left_input)
embedding_right_output = shared_embedding_layer(right_input)

# Compute the cosine similarity between the embedding vectors
cosine_similarity = tf.keras.layers.Dot(axes=-1, normalize=True)([embedding_left_output, embedding_right_output]
)

# Convert the cosine similarity to angular distance
pi = tf.constant(math.pi, dtype=tf.float32)
clip_cosine_similarities = tf.clip_by_value(cosine_similarity, -0.99999, 0.99999)
acos_distance = 1.0 - (tf.acos(clip_cosine_similarities) / pi)

# Package the model
encoder = tf.keras.Model([left_input, right_input], acos_distance)

# Compile the model
encoder.compile(
    optimizer=tf.keras.optimizers.Adam(
        learning_rate=0.00001,
        beta_1=0.9,
        beta_2=0.9999,
        epsilon=0.0000001,
        amsgrad=False,
        clipnorm=1.0,
        name="Adam",
    ),
    loss=tf.keras.losses.MeanSquaredError(reduction=keras.losses.Reduction.AUTO, name="mean_squared_error"),
    metrics=[tf.keras.metrics.MeanAbsoluteError(),
        tf.keras.metrics.MeanAbsolutePercentageError(),],
)

# Print the model summary
encoder.summary()

Fit model

# Define early stopping callback
early_stop = keras.callbacks.EarlyStopping(monitor="loss", patience=3, min_delta=0.001)

# Define TensorBoard callback
logdir = os.path.join(".", "logs/fit/" + datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)

# Model Input
left_inputs, right_inputs, similarity = process_model_input(data)

# Train the encoder model
history = encoder.fit([left_inputs, right_inputs],
    similarity,
    batch_size=8,
    epochs=20,
    validation_split=0.2,
    callbacks=[early_stop, tensorboard_callback],
)

# Define model input
inputs = keras.Input(shape=[], dtype=tf.string)

# Pass the input through the embedding layer
embedding = hub.KerasLayer(embedding_layer)(inputs)

# Create the tuned model
tuned_model = keras.Model(inputs=inputs, outputs=embedding)

当初咱们有了微调后的模型，让咱们从新评估它并将后果与根本模型的后果进行比拟。

# Creates embeddings for the sentences in the test_text list. 
# The np.array() function is used to convert the result into a numpy array.
# The .tolist() function is used to convert the numpy array into a list, which might be easier to work with.
vectors = np.array(tuned_model.predict(test_text)).tolist()

# Calls the plot_similarity function to create a similarity plot.
plot_similarity(test_text, vectors, 90, "tuned model")

# Computes STS benchmark score for the tuned model
pearsonr = sts_benchmark(tuned_model)
print("STS Benachmark:" + str(pearsonr))

基于在绝对较小的数据集上对模型进行微调，STS 基准分数与基线模型的分数相当，表明调整后的模型依然具备普适性。然而，相似性可视化显示类似题目之间的相似性得分加强，而不同题目的相似性得分升高。

微调预训练的 NLP 模型以进行畛域适应是一种弱小的技术，能够进步其在特定上下文中的性能和精度。通过利用高质量的、特定畛域的数据集和暹罗神经网络，咱们能够加强模型捕捉语义相似性的能力。

本教程以通用句子编码器 (USE) 模型为例，提供了微调过程的分步指南。咱们摸索了实践框架、数据筹备、基线模型评估和理论微调过程。后果证实了微调在加强域内相似性得分方面的有效性。

通过遵循此办法并将其适应您的特定畛域，您能够开释预训练 NLP 模型的全副后劲，并在自然语言解决工作中获得更好的后果

本文由 mdnice 多平台公布

正文完

自然语言处理

发表至：自然语言处理

2023-07-10

0

关于自然语言处理:经典50

关于自然语言处理:Rasa对话机器人连载五-第122课Rasa对话机器人Debugging项目实战之银行金融对话机器人全生命周期调试实战一

关于自然语言处理:知识库的分类梳理原则与实践经验

关于自然语言处理:利用Hugging-Face中的模型进行句子相似性实践

关于后端:基于Java微信小程序的学习交流平台第一稿中期检查表ppt开题任务书

关于自然语言处理:微调预训练的-NLP-模型

简介

动机

实践框架

办法

数据概览

终点：基线模型

微调模型

评估后果

总结

站内搜索