关于人工智能:AI-影评家-用-Hugging-Face-模型打造一个电影评分机器人

35次阅读

共计 11413 个字符,预计需要花费 29 分钟才能阅读完成。

本文为社区成员 Jun Chen 为 百姓 AI 和 Hugging Face 联结举办的黑客松所撰写的教程文档,欢送你浏览明天的第二条推送理解和加入本次黑客松流动。文内含有较多链接,咱们不再一一贴出,请 点击这里 查看渲染后的 Notebook 文件。

随着人工智能和大模型 ChatGPT 的继续火爆,越来越多的集体和创业者都想并且能够通过本人创立人工智能 APP 来摸索这个新兴畛域的机会。只有你有一个想法,你就能够通过各种凋谢社区和资源实现一些简略性能,满足特定畛域或者用户的需要。

试想当初有一部新的电影刚刚上线了,咱们和敌人在家热烈的探讨着这部新的电影,这些都是十分有价值的电影评估的信息,不过预计这个时候很少有人会特地去登陆本人的豆瓣账号再去发表这些刚刚的评论,如果有一个电影评论机器人能够主动收集这些评论并且依据评论打分,而后主动上传到制订的电影评论网站呢?再比方,咱们在某个餐厅吃饭,咱们只用对着手机说几句话,咱们的评分就主动上传到公众点评呢?咱们来试试如何实现这样一个小小的机器人吧!

在本教程中,咱们将摸索如何应用 Hugging Face 资源来 Finetune 一个模型且构建一个电影评分机器人。咱们将向大家展现如何整合这些资源,让你的聊天机器人具备总结评论并给出评分的性能。咱们会用通俗易懂的语言疏导你实现这个乏味的我的项目!

为了能够简略的阐明实现的步骤,咱们简化这个【电影打分机器人】的实现办法:

  1. App 间接收集来自 inputtext 作为输出,有趣味的小伙伴们能够钻研一下如何接入到语音,Whisper to ChatGPT 是一个很有好的例子。
  2. App 不会实现主动上传评估到特定网站。

第一步: 训练电影评估打分模型

首先咱们须要一个能够看懂评论且给评论打分的模型,这个例子选用的是利用数据集 IMDb 微调 DistilBERT,微调后的模型能够预测一个电影的评论是侧面的还是负面的且给出评分(五分满分)。

当然大家能够依据各自的需要找到不同的数据集来 Finetune 模型,也能够应用不同的根底模型,Hugging Face 上提供了很多可选项。

本工作应用或间接应用了上面模型的架构:

ALBERT, BART, BERT, BigBird, BigBird-Pegasus, BLOOM, CamemBERT, CANINE, ConvBERT, CTRL, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, ESM, FlauBERT, FNet, Funnel Transformer, GPT-Sw3, OpenAI GPT-2, GPT Neo, GPT-J, I-BERT, LayoutLM, LayoutLMv2, LayoutLMv3, LED, LiLT, Longformer, LUKE, MarkupLM, mBART, Megatron-BERT, MobileBERT, MPNet, MVP, Nezha, Nyströmformer, OpenAI GPT, OPT, Perceiver, PLBart, QDQBert, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, SqueezeBERT, TAPAS, Transformer-XL, XLM, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD, YOSO

# Transformers installation
! pip install transformers datasets evaluate
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

在应用本示例前请装置如下库文件:

pip install transformers datasets evaluate

咱们倡议登陆 Hugging Face 账户进行操作,这样就能够不便的上传和分享本人创立的模型。当有弹框时请输出集体的 token。依据下图找到咱们本人的 Hugging Face Tokens。

from huggingface_hub import notebook_login

notebook_login()
Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful

加载 IMDb 数据集

开始从 Datasets 库中加载 IMDb 数据集 🤗 :

from datasets import load_dataset

imdb = load_dataset("imdb")
Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]
Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]
Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]
Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...
Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]
Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]
Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]
Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]
Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.
  0%|          | 0/3 [00:00<?, ?it/s]

检查一下数据是否载入胜利:

imdb["test"][0]
{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as they have to always say"Gene Roddenberry\'s Earth..." otherwise people would not continue watching. Roddenberry\'s ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. Jeeez! Dallas all over again.',
 'label': 0}

在这个数据集中有两个字段:

  • text: 电影评论。
  • label: 0 或者 1。0 代表负面评估、1 代表侧面评估。

输出数据预处理

这一步是加载 DistilBERT tokenizer,并创立一个预处理函数来预处理 text,且保障输出不会大于 DistilBERT 的最长输出要求:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)
Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]
Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

应用 🤗 Datasets map 函数把预处理函数利用到整个数据集中。咱们还能够应用 batched=True 来减速 map:

tokenized_imdb = imdb.map(preprocess_function, batched=True)

应用 DataCollatorWithPadding 来生成数据包,这样动静的填充数据包到最大长度可能更加节俭资源。

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

模型评测函数

抉择一个适合的测评指标是至关重要的。大家能够间接调用库函数 🤗 Evaluate 里的各种测评指标。在这个例子中,咱们应用了 accuracy,理解更多请 查看文档疾速上手:

import evaluate

accuracy = evaluate.load("accuracy")
Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

这里咱们须要定义一个能够 计算 指标的函数:

import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

训练模型

在开始训练前,须要定义一个 id 到标签和标签到 id 的 map

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

如果不相熟如何应用 Trainer 来训练模型, 能够查看更具体的教程!

好了,所有曾经准备就绪!咱们能够应用 AutoModelForSequenceClassification 加载 DistilBERT 模型:

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id)
Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

接下来只有三步须要实现:

  1. 在 TrainingArguments 中定义模型超参,只有 output_dir 参数是必须的。咱们能够设置 push_to_hub=True 来间接上传训练好的模型(如果曾经登陆了 Hugging Face)。在每一个训练段,Trainer 都会评测模型的 accuracy 和保留此节点。
  2. 传入超参数,模型,数据集和评测函数到 Trainer。
  3. 调用 train() 来微调模型。
training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
Cloning https://huggingface.co/chenglu/my_awesome_model into local empty directory.
WARNING:huggingface_hub.repository:Cloning https://huggingface.co/chenglu/my_awesome_model into local empty directory.
/usr/local/lib/python3.9/dist-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Epoch Training Loss Validation Loss Accuracy
1 0.238700 0.188998 0.927600
2 0.151200 0.233457 0.93096
TrainOutput(global_step=3126, training_loss=0.20756478166244613, metrics={'train_runtime': 3367.9454, 'train_samples_per_second': 14.846, 'train_steps_per_second': 0.928, 'total_flos': 6561288258498624.0, 'train_loss': 0.20756478166244613, 'epoch': 2.0})

训练完结后大家就能够通过 push_to_hub() 办法 上传模型到 Hugging Face 上了,这样所有人都能够看见并且应用你的模型了。

第二步:模型上传到 Hugging Face

trainer.push_to_hub()
remote: Scanning LFS files of refs/heads/main for validity...        
remote: LFS file scan complete.        
To https://huggingface.co/YOURUSERNAME/my_awesome_model
   beedd7e..07a7f56  main -> main

WARNING:huggingface_hub.repository:remote: Scanning LFS files of refs/heads/main for validity...        
remote: LFS file scan complete.        
To https://huggingface.co/YOURUSERNAME/my_awesome_model
   beedd7e..07a7f56  main -> main

To https://huggingface.co/YOURUSERNAME/my_awesome_model
   07a7f56..94dee6f  main -> main

WARNING:huggingface_hub.repository:To https://huggingface.co/YOURUSERNAME/my_awesome_model
   07a7f56..94dee6f  main -> main

'https://huggingface.co/YOURUSERNAME/my_awesome_model/commit/07a7f56bd4c32596537816ff2fed565f29468f17'

大家能够在 PyTorch Notebook
或者 TensorFlow Notebook 查看更加具体的对于如何微调模型的教程。

第三步:创立本人的 App

祝贺大家曾经取得了本人的模型!上面咱们能够在 Hugging Face 中创立一个本人的 App 了。

创立新的 Hugging Face Space 利用

! pip install gradio torch

在 Spaces 主页上点击 Create new Space

增加 App 逻辑

app.py 文件中接入以下代码:

import gradio as gr
from transformers import pipeline
import torch

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

# 导入 HuggingFace 模型 咱们刚刚训练好而且上传胜利的模型 chjun/my_awesome_model
classifier = pipeline("sentiment-analysis", model="chjun/my_awesome_model")

# input:输出文本
def predict(inputs):
    label_score = classifier(inputs)
    scaled = 0
    if label_score[0]["label"] == "NEGATIVE":
      scaled = 1 - label_score[0]["score"]
    else:
      scaled = label_score[0]["score"]

    # 解码返回值失去输入
    return round(scaled * 5)

with gr.Blocks() as demo:
    review = gr.Textbox(label="用户评论。注:此模型只应用了英文数据 Finetune")
    output = gr.Textbox(label="颗星")
    submit_btn = gr.Button("提交")
    submit_btn.click(fn=predict, inputs=review, outputs=output, api_name="predict")    

demo.launch(debug=True)

胜利运行后,大家应该能够看见上面相似的界面:

留神,咱们须要把必须的库文件放在 requirements.txt 中,例如这个 App 须要:

gradio
torch
transformers

另外,因为咱们在示范中只跑了 2 个 epoch,所以最终模型 accuracy 不高。大家能够依据本人的状况调整超参和训练时长。

上传到 Hugging Face Spaces

$git add app.py
$git commit -m "Add application file"
$git push

而且 app.py 以及 requirements.txt 文件,都能够在 Hugging Face Hub 的界面上间接操作,如下图:

第四步: 实现机器人开发

当初,你曾经创立了一个可能依据电影评论给电影打分的机器人。当你向机器人发问时,它会应用 Hugging Face 的模型进行情感剖析,依据情感剖析后果给出一个评分。

chjun/movie_rating_bot 是依据以上教程实现的一个机器人 App,大家也能够间接复制这一个 Space 利用,并在此基础上更改开发。

点击 submit,与你的 AI 搭档互动吧!这个我的项目仅仅是一个终点,你能够依据本人的需要和趣味进一步欠缺这个聊天机器人,使其具备更多乏味的性能。

第五步: 接入 BaixingAI 机器人广场

还有更激动人心的一步,咱们能够把机器人接口依据 BaixingAI 机器人广场需要 扩大,让咱们本人创立的机器人能够去和其余机器人交换对话,以下是代码示范:

import gradio as gr
from transformers import pipeline
import torch

# 导入 HuggingFace 模型 咱们刚刚训练好而且上传胜利的模型 chjun/my_awesome_model
classifier = pipeline("sentiment-analysis", model="chjun/my_awesome_model")

# input:输出文本
def predict(user_review, qid, uid):
    label_score = classifier(user_review)
    scaled = 0
    if label_score[0]["label"] == "NEGATIVE":
      scaled = 1 - label_score[0]["score"]
    else:
      scaled = label_score[0]["score"]

    # 解码返回值失去输入
    return str(round(scaled * 5))

# user_review: 用户评估
# qid:以后音讯的惟一标识。例如 `'bxqid-cManAtRMszw...'`。由平台生成并传递给机器人,以便机器人辨别单个问题(写日志、追踪调试、异步回调等)。同步调用可疏忽。# uid:用户的惟一标识。例如 `'bxuid-Aj8Spso8Xsp...'`。由平台生成并传递给机器人,以便机器人辨别用户。可被用于实现多轮对话的性能。demo = gr.Interface(
    fn=predict,
    inputs=["text","text","text"],
    outputs="text",
  )

demo.launch()

更多详情请参考 Hugging Face baixing Spaces。

将来已来,各位 Hackathon 参赛者们都是探索者,预祝大家一切顺利!

正文完
 0