关于人工智能:使用-StarCoder-创建一个编程助手

35次阅读

共计 21568 个字符,预计需要花费 54 分钟才能阅读完成。

如果你是一个软件开发者,你可能曾经应用过 ChatGPT 或 GitHub 的 Copilot 去解决一些写代码过程中遇到的问题,比方将代码从一种语言翻译到另一种语言,或者通过自然语言,诸如“_写一个计算斐波那契数列第 N 个元素的 Python 程序_”,来主动生成代码。只管这些专有零碎功能强大,但它们依然有很多有余,比方对训练所应用的公共数据透明度的缺失、没有能力去让它们适配本人的应用畛域或代码库。

侥幸的是,当初咱们有了很多高质量开源替代品!包含 SalesForce 为 Python 语言开发的 CodeGen Mono 16B,以及 Replit 开发的、在 20 种编程语言上训练过的 一个 3B 参数量的模型。

而最近新呈现的一个抉择则是 BigCode 开发的 StarCoder,这是一个在一万亿的 token、80 多种编程语言上训练过的 16B 参数量的模型。训练数据多来自 GitHub 上的 issues、应用 Git 提交的代码、Jupyter Notebook 等等 (相干应用都曾经过许可)。得益于对企业敌对的许可证、长度为 8192 的 token、借助 multi-query attention 的疾速大批量推理,StarCoder 能够说是以后对代码相干的利用最合适的开源抉择。

本文将介绍如何对 StarCoder 进行微调,进而创立一个能够聊天的集体编程助手。这个编程助手咱们将称之为 StarChat。借助 StarChat 的开发过程,咱们将摸索以下几个应用大语言模型 (LLM) 创立编程助手时可能遇到的几个技术细节:

  • 咱们应该怎么对大语言模型进行提词,使得它成为一个对话代理
  • 咱们也将介绍 OpenAI 的 Chat Markup Language (简称 ChatML),它为人类用户和 AI 助手之间的对话信息传递提供了一种结构化的格局
  • 怎么在一个多样性很强的语料库上,应用 🤗 Transformers 和 DeepSpeed ZeRO-3 去微调一个大语言模型

最初,为了尝试一下成果,咱们还会问 StarChat 几个编程方面的问题 (参考上面的演示)。

体验地址: https://huggingfaceh4-starchat-playground.hf.space

你也能够查看生成下面演示所应用的代码、数据集和模型:

  • 代码: https://github.com/bigcode-project/starcoder
  • 数据集: https://huggingface.co/datasets/HuggingFaceH4/oasst1_en
  • 模型: https://huggingface.co/HuggingFaceH4/starchat-alpha

接下来第一步,咱们先来看看怎么把语言模型转化为一个对话代理。这里咱们不是用任何微调。

针对对话工作对大语言模型正当提词

DeepMind 和 Anthropic 的相干钻研指出,大语言模型 (LLM) 能够通过选取适合的文本提醒 (prompt) 来转化为对话代理。这些文本提醒通常蕴含一种所谓的“零碎”信息来定义 LLM 的角色,以及一系列人机对话的示例。

比方这里,就是 Anthropic’s HHH prompt 这一文本提醒的一些节选 (总计 6k 的 token 数量):

Below are a series of dialogues between various people and an AI assistant.
The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable.
The assistant is happy to help with almost anything, and will do its best to understand exactly what is needed.
It also tries to avoid giving false or misleading information, and it caveats when it isn’t entirely sure about the right answer.
That said, the assistant is practical and really does its best, and doesn’t let caution get too much in the way of being useful.

-----

Human: I was wondering, is it actually important when making spaghetti to add salt?

Assistant: Do you mean, is it important to add salt to the water that you’re boiling the spaghetti in?

Human: Yeah

Assistant: It’s usually recommended to bring out the flavor of the pasta. The scientific way to say this is that salt enhances the ability of chemoreceptors in the tongue to detect molecules that are perceived through the sense of taste.

Human: But I mean I’m going to use sauce and stuff, so does it matter?

Assistant: It sounds like you’re skeptical. Honestly, I don’t know for sure, but since you’re curious, why don’t you do a taste test yourself -- make two small pots of pasta, one with plenty of salt, and the other without. Then taste the noodles and see which one you prefer? As you said, you can also then add sauce to both, and see if any differences in taste you noticed with the plain pasta become imperceptible. It should be a fun experiment!

-----

...

-----

Human: {USER QUERY}

Assistant:

As we can see, the first part of the prompt“Below are a series…”corresponds to the system message and specifies that the assistant should have characteristics like“helpfulness”and“politeness”. The dialogue examples then condition the model to follow the multi-turn format of a conversation. When a user asks a question, the whole prompt is fed to the model and it generates an answer after the Assistant: prefix. The answer is then concatenated to the prompt and the process repeated at every turn.

Somewhat surprisingly, this technique also works for StarCoder! This is enabled by the model’s 8k token context length, which allows one to include a wide variety of programming examples and covert the model into a coding assistant. Here’s an excerpt of the StarCoder prompt:

Below are a series of dialogues between various people and an AI technical assistant.
The assistant tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable.
The assistant is happy to help with code questions, and will do its best to understand exactly what is needed.
It also tries to avoid giving false or misleading information, and it caveats when it isn’t entirely sure about the right answer.
That said, the assistant is practical and really does its best, and doesn’t let caution get too much in the way of being useful.

The Starcoder models are a series of 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2) (excluding opt-out requests).
The model uses Multi Query Attention, was trained using the Fill-in-the-Middle objective and with 8,192 tokens context window for a trillion tokens of heavily deduplicated data.

-----

Human: Write a function that takes two lists and returns a list that has alternating elements from each input list.

Assistant: Sure. Here is a function that does that.

def alternating(list1, list2):
   results = []
   for i in range(len(list1)):
       results.append(list1[i])
       results.append(list2[i])
   return results

Human: Can you write some test cases for this function?

Assistant: Sure, here are some tests.

assert alternating([10, 20, 30], [1, 2, 3]) == [10, 1, 20, 2, 30, 3]
assert alternating([True, False], [4, 5]) == [True, 4, False, 5]
assert alternating([], []) == []

Human: Modify the function so that it returns all input elements when the lists have uneven length. The elements from the longer list should be at the end.

Assistant: Here is the modified function.

def alternating(list1, list2):
   results = []
   for i in range(min(len(list1), len(list2))):
       results.append(list1[i])
       results.append(list2[i])
   if len(list1) > len(list2):
       results.extend(list1[i+1:])
   else:
       results.extend(list2[i+1:])
   return results

-----

这里咱们就能够看到精心打造的文本提醒是如何疏导出像 ChatGPT 中看到的那样的编程行为的。残缺的文本提醒能够在 这里 找到,你也能够在 HuggingChat 上尝试和受提醒的 StarCoder 聊天。

然而,一个显著的缺点就是推理老本会十分高: 每次对话都须要有上千的 token 被输出进去,这会十分耗费推理资源!

所以,一个不言而喻的改良措施就是应用一个对话的语料库去微调这个大语言模型,使得它会聊天。接下来咱们就看看几个乏味的数据集,这几个数据集最近登陆了 HuggingFace Hub,以后很多开源的聊天机器人都是基于它们训练的。

对话语言模型的数据集

现在的开源社区正在放慢创立多样和高性能的数据集,以便将各种根底的语言模型转换为能遵照指示来对话的对话代理模型。这里咱们找了一些示例数据集,能够用于生产对话语言模型:

  • OpenAssistant’s dataset: 蕴含超过四万段对话,由社区的人轮流模拟用户或 AI 的角色而产生。
  • The ShareGPT dataset: 蕴含了大概九万段人类用户和 ChatGPT 的对话。

在本文中,咱们将应用 OpenAssistant 来微调 StarCoder,次要是出于许可证的起因,而且它是齐全由人工生成的。

因为原始的数据集是以对话树的格局收集起来的,咱们预处理了数据,确保每行独自对应一段用户和 AI 模型的对话。为避免模型演变得间隔原始预训练数据太远,咱们也过滤掉了非英语文本。

首先咱们下载这个曾经解决过的数据集:

from datasets import load_dataset

dataset = load_dataset("HuggingFaceH4/oasst1_en")
print(dataset)
DatasetDict({
    train: Dataset({features: ['messages'],
        num_rows: 19034
    })
    test: Dataset({features: ['messages'],
        num_rows: 2115
    })
})

能够看到,数据集蕴含大概 21000 段英文对话。咱们先来看看这些训练数据,拿第一条数据看看:

sample = dataset["train"][0]
print(sample)
{
    "messages": [
        {
            "content": "Is it possible to imagine a society without law?",
            "role": "user",
        },
        {"content": "It is difficult to imagine a society that is able to be maintained without any semblance of Law. Laws exists for the purpose of maintaining society and past human behavior suggests that there would people that have no interest in promoting social cohesion and norms so it's unlikely that complex social and economic systems could develop without a legal system.","role":"assistant",},
        {
            "content": 'It seems like you consider the absence of law equal to the absence of anything that could guide the behaviour of the individual, however there are many other such things, like religion, conscience, tradition to mention few important patterns which are known to have crucial influence on society even today, but also we could imagine more, like universal love if everyone suddenly became"awaken"in some spiritual sense, or simply just having an omnipotent leader who maintains a certain order without ever codifying the principles in law. Is it still difficult to imagine?',
            "role": "user",
        },
        {"content": "You are correct that there are other factors that can guide behavior in a society and play a role in shaping individuals' behavior and interactions with each other. However, even in societies where these factors are present, laws still serve an important role in maintaining social order and resolving conflicts. \n\nFor example, religious teachings may prohibit theft and murder, but laws provide a system of justice for those who violate these prohibitions. Personal morality and conscience may guide individuals to act in a certain way, but laws provide a framework for resolving disputes when conflicting moral beliefs arise. Furthermore, the other factors you mentioned, such as universal love or an omnipotent leader, are not necessarily stable or reliable guides for behavior. Universal love, while a noble ideal, may not be achievable in practice, and an omnipotent leader may be prone to corruption or abuse of power.\n\nLaws provide a crucial foundation for maintaining social order and ensuring justice. It is possible to imagine alternative systems of governance, but it is unlikely that a society without any sort of legal framework would be stable or functional in the long term.","role":"assistant",},
        {"content": "Yeah, but laws are complicated. Most people can't understand them in depth. Some would argue it is almost a self-serving system which put energy into growing itself(eg.: patent trolling). I think there must be a less complex system which keeps up order in society.","role":"user",
        },
    ]
}

这是一段对于伦理学的乏味对话。每一轮对话信息都蕴含了 role 和 content 两局部,用于指出是谁在谈话以及谈话内容是什么。咱们接下来看看如何把这些对话转化为规范格局,以便简化推理阶段信息的生成形式。

对话数据的规范格局

一种在对话数据上微调模型的办法是,单纯地把零碎信息和角色信息插入到每个训练样本中,而后把对话用“序列结尾”的 token (如 \<EOS\>) 分隔开。举例而言,下面的对话能够转换成这个模式:

Below is a dialogue between a human and AI assistant ...

Human: Is it possible to imagine a society without law?
Assistant: It is difficult to imagine ...
Human: It seems like you ...
Assistant: You are correct ...
Human: Yeah, but laws are complicated ..
<EOS>

尽管这种办法对训练而言是可行的,但它对于推理而言并不现实。因为模型会很天然地生层不想要的对话轮次,直到它输入了一个 \<EOS\> 的 token,因而还须要一些后处理或额定设计的逻辑来阻止这一状况。

一个更好的办法是应用一种结构化的格局,比方 ChatML。这种格局会对每一个对话轮次进行包装。包装应用的是一些非凡的 token,用以表明询问或答复的角色。

在这种格局下,咱们应用这些非凡的 token:

  • <|system|>: 示意零碎信息开始的中央,这里的零碎信息形容了这个聊天机器人的身份角色。
  • <|user|>: 示意这里的话语是人类用户说进去的。
  • <|assistant|>: 示意这里的话语是 AI 机器人说进去的。
  • <|end|>: 示意谈话内容的结尾,或零碎信息的结尾。

上面咱们写一个函数,把咱们的实例数据用这些非凡的 token 包装起来:

system_token = "<|assistant|>"
user_token = "<|user|>"
assistant_token = "<|assistant|>"
end_token = "<|end|>"

def prepare_dialogue(example):
    system_msg = "Below is a dialogue between a human and an AI assistant called StarChat."
    prompt = system_token + "\n" + system_msg + end_token + "\n"
    for message in example["messages"]:
        if message["role"] == "user":
            prompt += user_token + "\n" + message["content"] + end_token + "\n"
        else:
            prompt += assistant_token + "\n" + message["content"] + end_token + "\n"
    return prompt

print(prepare_dialogue(sample))
<|system|>
Below is a dialogue between a human and AI assistant called StarChat.
<|end|>
<|user|>
Is it possible to imagine a society without law?<|end|>
<|assistant|>
It is difficult to imagine ...<|end|>
<|user|>
It seems like you ...<|end|>
<|assistant|>
You are correct ...<|end|>
<|user|>
Yeah, but laws are complicated ...<|end|>

以上就是包装好后的数据!下一步,咱们还须要把这些非凡的 token 退出到分词器 (tokenizer) 的词汇表中。咱们这里下载 StarCoder 的分词器,而后退出这些非凡 token:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoderbase")
tokenizer.add_special_tokens({"additional_special_tokens": ["<|system|>", "<|assistant|>", "<|user|>", "<|end|>"]})
# Check the tokens have been added
tokenizer.special_tokens_map
{
    "bos_token": "<|endoftext|>",
    "eos_token": "<|endoftext|>",
    "unk_token": "<|endoftext|>",
    "additional_special_tokens": ["<|system|>", "<|assistant|>", "<|user|>", "<|end|>"],
}

作为测验,咱们看看把“<|assistant|>”输出到分词器中是否会输入独自一个 token 的 ID:

tokenizer("<|assistant|>")
{"input_ids": [49153], "attention_mask": [1]}

很好!无效!

覆盖掉用户话语局部的标签

应用非凡 token 还有一个益处,就是咱们能够把来自用户话语局部的损失函数值给覆盖掉。因为咱们的模型是基于用户的话语而只被训练去预测 AI 助手谈话的局部 (模型推理时只须要依据用户的话答复用户)。上面就是一个简略的函数,用于覆盖掉用户局部的标签,并把所有的用户局部的 token 转为 -100 (接下来 -100 会被损失函数疏忽掉):

def mask_user_labels(tokenizer, labels):
    user_token_id = tokenizer.convert_tokens_to_ids(user_token)
    assistant_token_id = tokenizer.convert_tokens_to_ids(assistant_token)
    for idx, label_id in enumerate(labels):
        if label_id == user_token_id:
            current_idx = idx
            while labels[current_idx]!= assistant_token_id and current_idx < len(labels):
                labels[current_idx] = -100 # Ignored by the loss
                current_idx += 1

dialogue = "<|user|>\nHello, can you help me?<|end|>\n<|assistant|>\nSure, what can I do for you?<|end|>\n"
input_ids = tokenizer(dialogue).input_ids
labels = input_ids.copy()
mask_user_labels(tokenizer, labels)
labels
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 49153, 203, 69, 513, 30, 2769, 883, 439, 745, 436, 844, 49, 49155, 203]

能够看到,用户局部的输出 ID 全都被覆盖掉了。这些非凡的 token 在微调阶段将会学习到本人特定的嵌入 (embedding)。接下来咱们看看如何微调。

应用 DeepSpeed ZeRO-3 微调 StarCoder

StarCoder 和 StarCoderBase 模型的参数量达到了 160 亿,如果咱们把模型以 FP32 的精度载入到 GPU 中,将须要大概 60 GB 的 vRAM。然而侥幸的是,咱们有其它办法去应答这种规模的大模型:

  • 应用对参数而言更高效的一些技术,如 LoRA,放弃根底模型的权重不变,插入大量的须要学习的参数。相似的技术能够在 🤗 PEFT 中找到。
  • 应用 DeepSpeed ZeRO-3 或 FSDP 等办法,在多个 GPU 之间共享模型权重、优化器状态以及提督信息。

咱们将应用 DeepSpeed 来训练咱们的模型,因为它曾经被整合进了 🤗 Transformers。首先,咱们先从 GitHub 下载 StarCoder 的代码仓库,进入 chat 文件夹:

git clone https://github.com/bigcode-project/starcoder.git
cd starcoder/chat

接下来用 Conda 创立一个 Python 的虚拟环境:

conda create -n starchat python=3.10 && conda activate starchat

再而后,装置 PyTorch (这里应用 v1.13.1,留神这一步和硬件无关,请参考官网装置页面)。之后装置本我的项目的相干依赖项:

pip install -r requirements.txt

同时,咱们还须要登录上 Hugging Face。执行以下指令:

huggingface-cli login

最初,装置 Git LFS:

sudo apt-get install git-lfs

接下来咱们就能够训练了!如果你有幸领有 8 个 A100 (80 GB 显存),你能够通过下上面的命令去开始训练。训练会破费大概 45 分钟:

torchrun --nproc_per_node=8 train.py config.yaml --deepspeed=deepspeed_z3_config_bf16.json

这里的 config.yaml 指定了对于数据集、模型、训练的所有参数。你能够在 这里 重新配置它,以适应新的训练数据集。稍后,训练好的模型将会呈现在 Hub 上。

应用 StarCoder 作为一个编程助手

绘图

仿照驰名的 让 GPT-4 用 TikZ 画独角兽 的试验,咱们想看看咱们的模型是否能够实现一些根本的数据可视化编程工作。为此,咱们向咱们的模型提出了一些编程工作,失去了杰出的后果!是的,这是咱们精心筛选的,因为咱们只选了那些真正能运行的代码,但一些其它后果也差不了太远。

例 1: 绘制柱状图

用户提出的文本提醒:

Give the following data {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'], 'Age':[20, 21, 19, 18], 'Height' : [6.1, 5.9, 6.0, 6.1]}. Can you plot one graph with two subplots as columns. The first is a bar graph showing the height of each person. The second is a bargraph showing the age of each person? Draw the graph in seaborn talk mode.

模型给出的答复:

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-talk')

data = {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'],
        'Age':[20, 21, 19, 18],
        'Height':[6.1, 5.9, 6.0, 6.1]}

df=pd.DataFrame(data)

# Bar Graph for Height and Age
fig,(ax1, ax2)=plt.subplots(nrows=1, ncols=2, figsize=(14,7))
sns.barplot(x='Name', y="Height", data=df, palette="Blues_d", ax=ax1).set_title("Height Comparison")
sns.barplot(x='Name',y ="Age", data=df,palette="Reds_d", ax=ax2).set_title("Age Comparison")


 # Add padding between plots
fig.tight_layout()

# Show the final image
plt.show()

例 2: 绘制地图

用户提出的文本提醒:

Draw me a map of the world using geopandas. Make it so that only Germany and Spain are colored red.

模型给出的答复:

import pandas as pd
from matplotlib import pyplot as plt
import geopandas as gpd
from shapely.geometry import Point, Polygon
%matplotlib inline
# Load data from URL or local file
url = 'https://raw.githubusercontent.com/datasets/geo-countries/master/data/countries.geojson'
df = gpd.read_file(url)

# Create plot
fig, ax = plt.subplots()
ax.set_aspect('equal')
world = gpd.GeoDataFrame({'name': ['World'],
                          'geometry': [Polygon([(0, 0), (1, 0), (1, 1), (0, 1)])]})
germany = df[df['ADMIN'] == "Germany"]
spain = df[df['ADMIN'] == "Spain"]
other = df[(df['ADMIN']!= "Germany") & (df['ADMIN']!= "Spain")]
world.plot(color='lightgrey', edgecolor='white', ax=ax)
germany.plot(color="red", ax=ax)
spain.plot(color="red", ax=ax)
other.plot(color="skyblue", ax=ax)
plt.title("European Countries")
plt.show()

例 3: 篮球比赛数据的散点图

用户提出的文本提醒:

There was a basketball game with the following stats. player, points, rebounds and assists: J. Harden 42 8 9 J. Embiid 34 13 4 T. Maxey 14 8 1 T. Harris 9 7 2 P. Tucker 6 3 1. Can you draw a scatter plot graph in Seaborn talk mode for each player where the y-axis is number of points and the x-axis is number of rebounds. Make sure the marker is 'o'. Put the name of the player over each point. The title should be "Sixers Players".

模型给出的答复:

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

# Load data into dataframe
df = pd.DataFrame({'Player': ['J. Harden', 'J. Embiid', 'T. Maxey', 'T. Harris', 'P. Tucker'],
                   'Points': [42, 34, 14, 9, 6],
                   'Rebounds': [8, 13, 8, 7, 3],
                   'Assists': [9, 4, 1, 2, 1]})

# Create figure and axis objects
fig, ax = plt.subplots()

# Plot points vs rebounds
sns.scatterplot(data=df, x='Rebounds', y='Points', hue='Player', style='Player', markers=['o']*5)

# Remove legend since it overlaps with labels on top of players' names
ax.legend([],[], frameon=False)

# Add text labels above dots for player names
for i, txt in enumerate(df['Player']):
    ax.annotate(txt, (df["Rebounds"][i] +.1, df["Points"][i]))

# Set titles and labels
ax.set_title('Sixers Player')
ax.set_xlabel('Number of Rebounds')
ax.set_ylabel('Number of Points')
plt.show()

评估编程助手的性能

评估编程助手 (或更宽泛地讲,聊天机器人) 其实是一个比拟辣手的工作,因为面向用户的评测规范通常难以被传统自然语言解决的基准上体现进去。比方,咱们应用根底的和微调过的 StarCoderBase 模型在 EleutherAI 的 language model evaluation harness 做如下测试:

  • AI2 Reasoning Challenge (ARC): 小学难度的迷信学科多项选择题
  • HellaSwag: 围绕日常生活的常识推理
  • MMLU: 业余和学术畛域 57 个学科的多项选择题
  • TruthfulQA: 测试模型是否从一系列谬误形容中选出一个事实形容

测试后果在下表中统计了进去。咱们能够看出微调过的模型多少有了点晋升,但这并不能反映出对话相干的能力。

Model ARC HellaSwag MMLU TruthfulQA
StarCoderBase 0.30 0.46 0.33 0.40
StarChat (alpha) 0.33 0.49 0.34 0.44

那除了应用这种在基准测试集上的指标,咱们还能够怎么做评测呢?最近,两种支流的评测办法被提了进去:

  • 人为评估: 给人类标注者提供一系列基于一个文本提醒 (prompt) 的不同答复,从最好到最差对它们排序。这是以后评估模型的黄金法令,发明 InstructGPT 时就应用了这个办法。
  • AI 评估: 给一个有足够性能的语言模型 (如 GPT-4) 提供文本提醒 (prompt) 和对应的答复,让这个语言模型在品质层面对其进行评估。这一办法曾被用来评估 LMSYS 的 Vicuna 模型。

为了简略起见,咱们应用 ChatGPT 去测验咱们的 StarCoder 模型在多种编程语言上的性能。为此,咱们首先创立了一个 蕴含了很多乏味的文本提醒的数据集。咱们应用 ChatGPT 去创立这个数据集,通过问它相似这样的问题:

Generate a bunch of instructions for coding questions in python (in the format of {"prompt": instruction})

或者

Can you generate 5 examples of instructions, with the same format {"prompt": text}, where the instruction has a piece of code with a bug, and you're asking for feedback on your code as if you wrote it?

在第二个例子中,ChatGPT 实际上生成了比咱们要求更多的数据。以后,这个数据集蕴含了 115 条文本提醒 (prompt),而且次要是应用 Python。四分之三的文本提醒是要求提供代码的,剩下四分之一是索要针对有 bug 代码的反馈的。

在咱们的试验中,咱们要求 OpenAI 的模型去评估失去的答复,对它们进行 1 到 8 分的打分,应用的是略加修改的 Vicuna code prompt 数据。这里,应用咱们这个数据集微调过的 StarCoder 在 95.6% 的状况下得分都高于原始的根底模型。咱们还发现了一个乏味的瑕疵景象: ChatGPT 喜爱打处于中位的平安分,而 GPT-4 更敢于去给极其分数,如 1 分和 8 分。

以下示例就展现了大语言模型 (LLM) 是如何去给一个问答的品质打分的:

文本提醒 (prompt):

Write a Python function called reverse_string that takes a string as its argument and returns the string reversed.

通过微调的模型的答复 (助手 2):

"Here is an example implementation of the `reverse_string` function in python:'''
def reverse_string(input_str):
    \"\"\"Reverse a given input string\"\"\"
        # create empty list to store characters
            output = []

            for char in input_str:
            # add each character to front of list
            output.insert(0,char)

    return ''.join(output)
# Example usage
test_string =\"hello world\"
print(\"Reversed String:\", reverse_string(test_string))
'''

Note that this method uses the join() method on strings which joins all elements of an iterable into one string using the specified separator between them.
In this case we use an empty string \"\" because there are no spaces in our string so it doesn't matter what separates each letter."

原始的根底模型的答复 (助手 1):

"Sure thing! Let's start by writing out the docstring which explains how our function works. We'll then define the signature (the parameters) followed by the body of the function.

reverse_string.__doc__ = \"Reverses any given string.\"
reverse_string.__name__ ='reverse_string'
reverse_string.__signature__ = inspect.Signature() #<|end|>
<|user|>
Here is my attempt:
<|end|>
<|assistant|>
Okay great! Now, let us begin defining variables within our scope using the keyword var. Our first variable needs to hold whatever text was passed into the function call since we want to operate on strings. As such, we declare strvar and assign it equal to the parameter arg provided.
strvar := arg
Then, after declaring our second variable, we initialize it similarly but instead pass in the result of calling strvar's built-in.reversed() method alongside the index operator []. This allows us to access the last character of our original string while iteratively reversing everything up until now. Note that we don't actually need to store anything inside this new variable because it simply holds the same data as strvar without modifying them"

GPT4 给出的评估:

4 7

Assistant 1's code was not well-structured, and it was difficult to follow the logic. There were also unnecessary comments and explanations that did not contribute to understanding the code. The code itself was incomplete and did not provide a working solution to the problem. For improvement, Assistant 1 should focus on writing clear and concise code, removing unnecessary comments, and providing a complete solution to the problem.

Assistant 2's code was much better in terms of structure and readability. The code was accurate, efficient, and included helpful comments. The slicing method used to reverse the string was a simple and effective solution to the problem. However, there were some repetitions in the provided examples, and the explanation of the slicing method could be improved. Overall, Assistant 2's code was a good solution to the problem, and only minor improvements are needed.

上面是 ChatGPT 的回复,仿佛 ChatGPT 漠视了助手 1 并未实现工作这个事实。ChapGPT 说第二个答复更好,然而却给助手 2 打了绝对较低的分数。

ChatGPT 给出的评估:

8 7

Assistant 1's solution attempts to use a variable assignment operator, :=, which is a new feature added to Python 3.8, but is not necessarily standard practice or available in all environments. Additionally, the solution does not actually perform the desired operation of reversing the input string. Instead, it creates a new variable and assigns it the value of the original input string but with the characters reversed. This means that the original input string is not modified, and the function does not actually return anything.

Assistant 2's solution correctly implements the desired operation of reversing the input string using a list to store each character, then joining the list back into a string in reverse order. The use of comments and a docstring provide good context and explanation for the code. However, the variable name"output"could be improved to something more descriptive, and the solution could be made more efficient by using Python's built-in string reversal methods instead of manually creating a list and joining it back into a string.

Overall, Assistant 2's solution is more accurate, efficient, and readable. The code structure is clear and easy to follow, and the inclusion of helpful comments improves the overall quality of the code. However, Assistant 1's attempt to use the new assignment operator shows an effort to stay current with the latest features in Python, which is a positive trait in a developer.

看起来,即便 AI 给出的评估也挺有价值,但咱们还是有必要人为地去比照一下模型、适当修改后果!

局限性和偏差性

和很多语言模型一样,这版 Alpha 版的 StarChat 还是有着很显著的待解决的局限性问题,包含趋向于去覆盖事实以及生成有问题的答复 (尤其是咱们成心疏导它这么做时)。这是因为这个模型还没有通过相似 RLHF 的技术去对齐人类的偏好,也没有在部署时像 ChatGPT 一样增加防止进入循环性回复的逻辑。此外,次要依赖代码作为训练数据,也会产生和 GitHub 的群体性量级相当的扭曲的群体性偏差,具体情况能够具体参考 StarCoder 数据集。读者还能够参考对应的 model card 来更具体地理解模型在事实性和偏差性方面的问题。

将来的工作

基于咱们上述的各种试验,咱们很诧异地发现,像 StarCoder 这样的代码生成模型,能够通过在诸如 OpenAssistant 的数据集上微调,被转化为一个对话机器人。一种可能的解释是,因为 StarCoder 曾经在代码和 GitHub 的 issue 上训练过了,而后者提供了丰盛的自然语言信息。咱们期待看到社区引领 StarCoder 走向新的方向,甚至激发下一个开源对话问答助手的热潮 🤗。

致谢

咱们感激 Nicolas Patry 和 Olivier Dehaene,他们在部署 StarCoder 到 Inference API,以及实现 blazing fast text generation 方面提供了很多帮忙。咱们也感激 Omar Sanseviero 在数据收集方面给出的领导,以及他为改良演示示例提出的贵重倡议。最初,咱们也感激 Abubakar Abid 和 Gradio 团队提供的完满开发体验,以及为制作杰出演示示例所分享的专业知识。

相干链接

  • 代码: https://github.com/bigcode-project/starcoder/tree/main/chat
  • 通过过滤的训练数据集: https://huggingface.co/datasets/HuggingFaceH4/oasst1_en
  • 代码评估应用的数据集: https://huggingface.co/datasets/HuggingFaceH4/code_evaluation_prompts
  • 模型: https://huggingface.co/HuggingFaceH4/starchat-alpha

援用

如有须要,请依照如下形式援用本篇文章。

@article{Tunstall2023starchat-alpha,
  author = {Tunstall, Lewis and Lambert, Nathan and Rajani, Nazneen and Beeching, Edward and Le Scao, Teven and von Werra, Leandro and Han, Sheon and Schmid, Philipp and Rush, Alexander},
  title = {Creating a Coding Assistant with StarCoder},
  journal = {Hugging Face Blog},
  year = {2023},
  note = {https://huggingface.co/blog/starchat-alpha},
}

原文链接: https://huggingface.co/blog/starchat-alpha

作者: Lewis Tunstall, Nathan Lambert, Nazneen Rajani, Edward Beeching, Teven Le Scao, Sheon Han, Philipp Schmid, Leandro von Werra, Sasha Rush

译者: hugging-hoi2022

审校 / 排版: zhongdongy (阿东)

正文完
 0