本文将演示如何应用PEFT、QLoRa和Huggingface对新的lama-2进行微调,生成本人的代码生成器。所以本文将重点展现如何定制本人的llama2,进行疾速训练,以实现特定工作。

一些知识点

llama2相比于前一代,令牌数量减少了40%,达到2T,上下文长度减少了一倍,并利用分组查问留神(GQA)技术来减速在较重的70B模型上的推理。在规范的transformer 体系结构上,应用RMSNorm归一化、SwiGLU激活和旋转地位嵌入,上下文长度达到了4096个,并利用了具备余弦学习率调度、权重衰减0.1和梯度裁剪的Adam优化器。

有监督微调(SFT)阶段的特点是优先思考品质样本而不是数量,因为许多报告表明,应用高质量数据能够进步最终模型的性能。

最初,通过带有人类反馈的强化学习(RLHF)步骤使模型与用户偏好保持一致。收集了大量示例,其中人类在比拟中抉择他们首选的模型输入。这些数据被用来训练处分模型。

最次要的一点是,LLaMA 2-CHAT曾经和OpenAI ChatGPT一样好了,所以咱们能够应用它作为咱们本地的一个代替了

数据集

对于的微调过程,咱们将应用大概18,000个示例的数据集,其中要求模型构建解决给定工作的Python代码。这是原始数据集[2]的提取,其中只抉择了Python语言示例。每行蕴含要解决的工作的形容,如果实用的话,工作的数据输出示例,并提供解决工作的生成代码片段[3]。

 # Load dataset from the hub dataset = load_dataset(dataset_name, split=dataset_split) # Show dataset size print(f"dataset size: {len(dataset)}") # Show an example print(dataset[randrange(len(dataset))])

创立提醒

为了执行指令微调,咱们必须将每个数据示例转换为指令,并将其次要局部概述如下:

 def format_instruction(sample):  return f"""### Instruction: Use the Task below and the Input given to write the Response, which is a programming code that can solve the following Task:  ### Task: {sample['instruction']}  ### Input: {sample['input']}  ### Response: {sample['output']} """

输入的后果是这样的:

 ### Instruction: Use the Task below and the Input given to write the Response, which is a programming code that can solve the following Task:  ### Task: Develop a Python program that prints "Hello, World!" whenever it is run.  ### Input:   ### Response: #Python program to print "Hello World!"  print("Hello, World!")

微调模型

为了不便演示,咱们应用Google Colab环境,对于第一次测试运行,T4实例就足够了,然而当波及到运行整个数据集训练,则须要应用A100。

除此以外,还能够登录Huggingface hub ,这样能够上传和共享模型,当然这个是可选项。

 from huggingface_hub import login from dotenv import load_dotenv import os  # Load the enviroment variables load_dotenv() # Login to the Hugging Face Hub login(token=os.getenv("HF_HUB_TOKEN"))

PEFT、Lora和QLora

训练LLM的通常步骤包含:首先,对数十亿或数万亿个令牌进行预训练失去根底模型,而后对该模型进行微调,使其专门用于上游工作。

参数高效微调(PEFT)容许咱们通过微调大量额定参数来大大减少RAM和存储需要,因为所有模型参数都放弃解冻状态。并且PEFT还加强了模型的可重用性和可移植性,它很容易将小的检查点增加到根本模型中,通过增加PEFT参数让根底模型在多个场景中重用。最初因为没有调整根本模型,还能够保留在预训练阶段取得的所有常识,从而防止了灾难性忘记。

PEFT放弃预训练的根本模型不变,并在其上增加新的层或参数。这些层被称为“适配器”,咱们将这些层增加到预训练的根本模型中,只训练这些新层的参数。然而这种办法的一个重大问题是,这些层会导致推理阶段的提早减少,从而使流程在许多状况下效率低下。

而在LoRa技术(大型语言模型的低秩适应)中不是增加新的层,而是以一种防止在推理阶段呈现这种可怕的提早问题的形式向模型各层参数增加值。LoRa训练并存储附加权重的变动,同时解冻预训练模型的所有权重。也就是说咱们利用预训练模型矩阵的变动训练一个新的权重矩阵,并将这个新矩阵合成为2个低秩矩阵,如下所示:

LoRA[1]的作者提出权值变动矩阵∆W的变动能够合成为两个低秩矩阵A和b。LoRA不间接训练∆W中的参数,而是间接训练A和b中的参数,因而可训练参数的数量要少得多。假如A的维数为100 1,B的维数为1 100,则∆W中的参数个数为100 * 100 = 10000。在A和B中训练的人数只有100 + 100 = 200,而在∆W中训练的个数是10000

这些低秩矩阵的大小由r参数定义。这个值越小,须要训练的参数就越少,速度更快。然而参数过少可能会损失信息和性能,所以r参数的抉择也是须要思考的问题。

最初,QLoRa[6]则是将量化利用于LoRa办法,通过优化内存应用的技巧,以实现“更轻量”和更便宜的训练。

微调流程

咱们的示例中应用QLoRa,所以要指定BitsAndBytes配置,下载4位量化的预训练模型,定义LoraConfig。

 # Get the type compute_dtype = getattr(torch, bnb_4bit_compute_dtype)  # BitsAndBytesConfig int-4 config bnb_config = BitsAndBytesConfig(     load_in_4bit=use_4bit,     bnb_4bit_use_double_quant=use_double_nested_quant,     bnb_4bit_quant_type=bnb_4bit_quant_type,     bnb_4bit_compute_dtype=compute_dtype ) # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained(model_id,    quantization_config=bnb_config, use_cache = False, device_map=device_map) model.config.pretraining_tp = 1 # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right"

上面是参数定义,

 # Activate 4-bit precision base model loading use_4bit = True # Compute dtype for 4-bit base models bnb_4bit_compute_dtype = "float16" # Quantization type (fp4 or nf4) bnb_4bit_quant_type = "nf4" # Activate nested quantization for 4-bit base models (double quantization) use_double_nested_quant = False # LoRA attention dimension lora_r = 64 # Alpha parameter for LoRA scaling lora_alpha = 16 # Dropout probability for LoRA layers lora_dropout = 0.1

接下来的步骤对于所有的Hugging Face用户来说应该都很相熟了,设置训练参数,创立Trainer。在执行指令微调时,咱们调用封装PEFT模型定义和其余步骤的SFTTrainer办法。

 # Define the training arguments args = TrainingArguments(     output_dir=output_dir,     num_train_epochs=num_train_epochs,     per_device_train_batch_size=per_device_train_batch_size, # 6 if use_flash_attention else 4,     gradient_accumulation_steps=gradient_accumulation_steps,     gradient_checkpointing=gradient_checkpointing,     optim=optim,     logging_steps=logging_steps,     save_strategy="epoch",     learning_rate=learning_rate,     weight_decay=weight_decay,     fp16=fp16,     bf16=bf16,     max_grad_norm=max_grad_norm,     warmup_ratio=warmup_ratio,     group_by_length=group_by_length,     lr_scheduler_type=lr_scheduler_type,     disable_tqdm=disable_tqdm,     report_to="tensorboard",     seed=42 ) # Create the trainer trainer = SFTTrainer(     model=model,     train_dataset=dataset,     peft_config=peft_config,     max_seq_length=max_seq_length,     tokenizer=tokenizer,     packing=packing,     formatting_func=format_instruction,     args=args, ) # train the model trainer.train() # there will not be a progress bar since tqdm is disabled  # save model in local trainer.save_model()

这些参数大多数通常用于llm上的其余微调脚本,咱们就不做过多的阐明了:

 # Number of training epochs num_train_epochs = 1 # Enable fp16/bf16 training (set bf16 to True with an A100) fp16 = False bf16 = True # Batch size per GPU for training per_device_train_batch_size = 4 # Number of update steps to accumulate the gradients for gradient_accumulation_steps = 1 # Enable gradient checkpointing gradient_checkpointing = True # Maximum gradient normal (gradient clipping) max_grad_norm = 0.3 # Initial learning rate (AdamW optimizer) learning_rate = 2e-4 # Weight decay to apply to all layers except bias/LayerNorm weights weight_decay = 0.001 # Optimizer to use optim = "paged_adamw_32bit" # Learning rate schedule lr_scheduler_type = "cosine" #"constant" # Ratio of steps for a linear warmup (from 0 to learning rate) warmup_ratio = 0.03 # Group sequences into batches with same length # Saves memory and speeds up training considerably group_by_length = False # Save checkpoint every X updates steps save_steps = 0 # Log every X updates steps logging_steps = 25 # Disable tqdm disable_tqdm= True

合并权重

正如下面咱们提到的办法,LoRa在根本模型上训练了“批改权重”,所以最终模型须要将预训练的模型和适配器权重合并到一个模型中。

 from peft import AutoPeftModelForCausalLM  model = AutoPeftModelForCausalLM.from_pretrained(     args.output_dir,     low_cpu_mem_usage=True,     return_dict=True,     torch_dtype=torch.float16,     device_map=device_map,     )  # Merge LoRA and base model merged_model = model.merge_and_unload()  # Save the merged model merged_model.save_pretrained("merged_model",safe_serialization=True) tokenizer.save_pretrained("merged_model") # push merged model to the hub merged_model.push_to_hub(hf_model_repo) tokenizer.push_to_hub(hf_model_repo)

推理

最初就是推理的过程了

 import torch from transformers import AutoModelForCausalLM, AutoTokenizer  # Get the tokenizer tokenizer = AutoTokenizer.from_pretrained(hf_model_repo) # Load the model model = AutoModelForCausalLM.from_pretrained(hf_model_repo, load_in_4bit=True,                                               torch_dtype=torch.float16,                                              device_map=device_map) # Create an instruction instruction="Optimize a code snippet written in Python. The code snippet should create a list of numbers from 0 to 10 that are divisible by 2." input=""  prompt = f"""### Instruction: Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.  ### Task: {instruction}  ### Input: {input}  ### Response: """ # Tokenize the input input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda() # Run the model to infere an output outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.5)  # Print the result print(f"Prompt:\n{prompt}\n") print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")

后果如下:

 Prompt: ### Instruction: Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.  ### Task: Optimize a code snippet written in Python. The code snippet should create a list of numbers from 0 to 10 that are divisible by 2.  ### Input: arr = [] for i in range(10):  if i % 2 == 0:  arr.append(i)  ### Response:   Generated instruction: arr = [i for i in range(10) if i % 2 == 0]  Ground truth: arr = [i for i in range(11) if i % 2 == 0]

看样子还是很不错的

总结

以上就是咱们微调llama2的残缺过程,这外面的一个最重要的步骤其实是提醒的生成,一个好的提醒对于模型的性能也是十分有帮忙的。

[1] Llama-2 paper https://arxiv.org/pdf/2307.09288.pdf

[2] python code dataset http://sahil2801/code_instructions_120k

[3] 本文应用的数据集 https://huggingface.co/datasets/iamtarun/python_code_instruct...

[4] LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685

[5]. QLoRa: Efficient Finetuning of QuantizedLLMs arXiv:2305.14314

https://avoid.overfit.cn/post/9794c9eef1df4e55adf514b3d727ee3b

作者:Eduardo Muñoz