关于人工智能:微调llama2模型教程创建自己的Python代码生成器

本文将演示如何应用 PEFT、QLoRa 和 Huggingface 对新的 lama- 2 进行微调，生成本人的代码生成器。所以本文将重点展现如何定制本人的 llama2，进行疾速训练，以实现特定工作。

llama2 相比于前一代，令牌数量减少了 40%，达到 2T，上下文长度减少了一倍，并利用分组查问留神 (GQA) 技术来减速在较重的 70B 模型上的推理。在规范的 transformer 体系结构上，应用 RMSNorm 归一化、SwiGLU 激活和旋转地位嵌入，上下文长度达到了 4096 个，并利用了具备余弦学习率调度、权重衰减 0.1 和梯度裁剪的 Adam 优化器。

有监督微调 (SFT) 阶段的特点是优先思考品质样本而不是数量，因为许多报告表明，应用高质量数据能够进步最终模型的性能。

最初，通过带有人类反馈的强化学习 (RLHF) 步骤使模型与用户偏好保持一致。收集了大量示例，其中人类在比拟中抉择他们首选的模型输入。这些数据被用来训练处分模型。

最次要的一点是，LLaMA 2-CHAT 曾经和 OpenAI ChatGPT 一样好了，所以咱们能够应用它作为咱们本地的一个代替了

对于的微调过程，咱们将应用大概 18,000 个示例的数据集，其中要求模型构建解决给定工作的 Python 代码。这是原始数据集 [2] 的提取，其中只抉择了 Python 语言示例。每行蕴含要解决的工作的形容，如果实用的话，工作的数据输出示例，并提供解决工作的生成代码片段[3]。

 # Load dataset from the hub
 dataset = load_dataset(dataset_name, split=dataset_split)
 # Show dataset size
 print(f"dataset size: {len(dataset)}")
 # Show an example
 print(dataset[randrange(len(dataset))])

为了执行指令微调，咱们必须将每个数据示例转换为指令，并将其次要局部概述如下:

 def format_instruction(sample):
  return f"""### Instruction:
 Use the Task below and the Input given to write the Response, which is a programming code that can solve the following Task:
 
 ### Task:
 {sample['instruction']}
 
 ### Input:
 {sample['input']}
 
 ### Response:
 {sample['output']}
 """

输入的后果是这样的：

 ### Instruction:
 Use the Task below and the Input given to write the Response, which is a programming code that can solve the following Task:
 
 ### Task:
 Develop a Python program that prints "Hello, World!" whenever it is run.
 
 ### Input:
 
 
 ### Response:
 #Python program to print "Hello World!"
 
 print("Hello, World!")

为了不便演示，咱们应用 Google Colab 环境，对于第一次测试运行，T4 实例就足够了，然而当波及到运行整个数据集训练，则须要应用 A100。

除此以外，还能够登录 Huggingface hub，这样能够上传和共享模型，当然这个是可选项。

 from huggingface_hub import login
 from dotenv import load_dotenv
 import os
 
 # Load the enviroment variables
 load_dotenv()
 # Login to the Hugging Face Hub
 login(token=os.getenv("HF_HUB_TOKEN"))

训练 LLM 的通常步骤包含: 首先，对数十亿或数万亿个令牌进行预训练失去根底模型，而后对该模型进行微调，使其专门用于上游工作。

参数高效微调 (PEFT) 容许咱们通过微调大量额定参数来大大减少 RAM 和存储需要，因为所有模型参数都放弃解冻状态。并且 PEFT 还加强了模型的可重用性和可移植性，它很容易将小的检查点增加到根本模型中，通过增加 PEFT 参数让根底模型在多个场景中重用。最初因为没有调整根本模型，还能够保留在预训练阶段取得的所有常识，从而防止了灾难性忘记。

PEFT 放弃预训练的根本模型不变，并在其上增加新的层或参数。这些层被称为“适配器”，咱们将这些层增加到预训练的根本模型中，只训练这些新层的参数。然而这种办法的一个重大问题是，这些层会导致推理阶段的提早减少，从而使流程在许多状况下效率低下。

而在 LoRa 技术 (大型语言模型的低秩适应) 中不是增加新的层，而是以一种防止在推理阶段呈现这种可怕的提早问题的形式向模型各层参数增加值。LoRa 训练并存储附加权重的变动，同时解冻预训练模型的所有权重。也就是说咱们利用预训练模型矩阵的变动训练一个新的权重矩阵，并将这个新矩阵合成为 2 个低秩矩阵，如下所示:

LoRA[1]的作者提出权值变动矩阵∆W 的变动能够合成为两个低秩矩阵 A 和 b。LoRA 不间接训练∆W 中的参数，而是间接训练 A 和 b 中的参数，因而可训练参数的数量要少得多。假如 A 的维数为 100 1,B 的维数为 1 100，则∆W 中的参数个数为 100 * 100 = 10000。在 A 和 B 中训练的人数只有 100 + 100 = 200，而在∆W 中训练的个数是 10000

这些低秩矩阵的大小由 r 参数定义。这个值越小，须要训练的参数就越少，速度更快。然而参数过少可能会损失信息和性能，所以 r 参数的抉择也是须要思考的问题。

最初，QLoRa[6]则是将量化利用于 LoRa 办法，通过优化内存应用的技巧，以实现“更轻量”和更便宜的训练。

咱们的示例中应用 QLoRa，所以要指定 BitsAndBytes 配置，下载 4 位量化的预训练模型，定义 LoraConfig。

 # Get the type
 compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
 
 # BitsAndBytesConfig int-4 config
 bnb_config = BitsAndBytesConfig(
     load_in_4bit=use_4bit,
     bnb_4bit_use_double_quant=use_double_nested_quant,
     bnb_4bit_quant_type=bnb_4bit_quant_type,
     bnb_4bit_compute_dtype=compute_dtype
 )
 # Load model and tokenizer
 model = AutoModelForCausalLM.from_pretrained(model_id, 
   quantization_config=bnb_config, use_cache = False, device_map=device_map)
 model.config.pretraining_tp = 1
 # Load the tokenizer
 tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
 tokenizer.pad_token = tokenizer.eos_token
 tokenizer.padding_side = "right"

上面是参数定义,

 # Activate 4-bit precision base model loading
 use_4bit = True
 # Compute dtype for 4-bit base models
 bnb_4bit_compute_dtype = "float16"
 # Quantization type (fp4 or nf4)
 bnb_4bit_quant_type = "nf4"
 # Activate nested quantization for 4-bit base models (double quantization)
 use_double_nested_quant = False
 # LoRA attention dimension
 lora_r = 64
 # Alpha parameter for LoRA scaling
 lora_alpha = 16
 # Dropout probability for LoRA layers
 lora_dropout = 0.1

接下来的步骤对于所有的 Hugging Face 用户来说应该都很相熟了，设置训练参数，创立 Trainer。在执行指令微调时，咱们调用封装 PEFT 模型定义和其余步骤的 SFTTrainer 办法。

 # Define the training arguments
 args = TrainingArguments(
     output_dir=output_dir,
     num_train_epochs=num_train_epochs,
     per_device_train_batch_size=per_device_train_batch_size, # 6 if use_flash_attention else 4,
     gradient_accumulation_steps=gradient_accumulation_steps,
     gradient_checkpointing=gradient_checkpointing,
     optim=optim,
     logging_steps=logging_steps,
     save_strategy="epoch",
     learning_rate=learning_rate,
     weight_decay=weight_decay,
     fp16=fp16,
     bf16=bf16,
     max_grad_norm=max_grad_norm,
     warmup_ratio=warmup_ratio,
     group_by_length=group_by_length,
     lr_scheduler_type=lr_scheduler_type,
     disable_tqdm=disable_tqdm,
     report_to="tensorboard",
     seed=42
 )
 # Create the trainer
 trainer = SFTTrainer(
     model=model,
     train_dataset=dataset,
     peft_config=peft_config,
     max_seq_length=max_seq_length,
     tokenizer=tokenizer,
     packing=packing,
     formatting_func=format_instruction,
     args=args,
 )
 # train the model
 trainer.train() # there will not be a progress bar since tqdm is disabled
 
 # save model in local
 trainer.save_model()

这些参数大多数通常用于 llm 上的其余微调脚本，咱们就不做过多的阐明了：

 # Number of training epochs
 num_train_epochs = 1
 # Enable fp16/bf16 training (set bf16 to True with an A100)
 fp16 = False
 bf16 = True
 # Batch size per GPU for training
 per_device_train_batch_size = 4
 # Number of update steps to accumulate the gradients for
 gradient_accumulation_steps = 1
 # Enable gradient checkpointing
 gradient_checkpointing = True
 # Maximum gradient normal (gradient clipping)
 max_grad_norm = 0.3
 # Initial learning rate (AdamW optimizer)
 learning_rate = 2e-4
 # Weight decay to apply to all layers except bias/LayerNorm weights
 weight_decay = 0.001
 # Optimizer to use
 optim = "paged_adamw_32bit"
 # Learning rate schedule
 lr_scheduler_type = "cosine" #"constant"
 # Ratio of steps for a linear warmup (from 0 to learning rate)
 warmup_ratio = 0.03
 # Group sequences into batches with same length
 # Saves memory and speeds up training considerably
 group_by_length = False
 # Save checkpoint every X updates steps
 save_steps = 0
 # Log every X updates steps
 logging_steps = 25
 # Disable tqdm
 disable_tqdm= True

正如下面咱们提到的办法，LoRa 在根本模型上训练了“批改权重”，所以最终模型须要将预训练的模型和适配器权重合并到一个模型中。

 from peft import AutoPeftModelForCausalLM
 
 model = AutoPeftModelForCausalLM.from_pretrained(
     args.output_dir,
     low_cpu_mem_usage=True,
     return_dict=True,
     torch_dtype=torch.float16,
     device_map=device_map,    
 )
 
 # Merge LoRA and base model
 merged_model = model.merge_and_unload()
 
 # Save the merged model
 merged_model.save_pretrained("merged_model",safe_serialization=True)
 tokenizer.save_pretrained("merged_model")
 # push merged model to the hub
 merged_model.push_to_hub(hf_model_repo)
 tokenizer.push_to_hub(hf_model_repo)

最初就是推理的过程了

 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
 # Get the tokenizer
 tokenizer = AutoTokenizer.from_pretrained(hf_model_repo)
 # Load the model
 model = AutoModelForCausalLM.from_pretrained(hf_model_repo, load_in_4bit=True, 
                                              torch_dtype=torch.float16,
                                              device_map=device_map)
 # Create an instruction
 instruction="Optimize a code snippet written in Python. The code snippet should create a list of numbers from 0 to 10 that are divisible by 2."
 input=""prompt = f"""### Instruction:
 Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.
 
 ### Task:
 {instruction}
 
 ### Input:
 {input}
 
 ### Response:
 """
 # Tokenize the input
 input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
 # Run the model to infere an output
 outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.5)
 
 # Print the result
 print(f"Prompt:\n{prompt}\n")
 print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")

后果如下：

 Prompt:
 ### Instruction:
 Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.
 
 ### Task:
 Optimize a code snippet written in Python. The code snippet should create a list of numbers from 0 to 10 that are divisible by 2.
 
 ### Input:
 arr = []
 for i in range(10):
  if i % 2 == 0:
  arr.append(i)
 
 ### Response:
 
 
 Generated instruction:
 arr = [i for i in range(10) if i % 2 == 0]
 
 Ground truth:
 arr = [i for i in range(11) if i % 2 == 0]

看样子还是很不错的

以上就是咱们微调 llama2 的残缺过程，这外面的一个最重要的步骤其实是提醒的生成，一个好的提醒对于模型的性能也是十分有帮忙的。

[1] Llama-2 paper https://arxiv.org/pdf/2307.09288.pdf

[2] python code dataset http://sahil2801/code_instructions_120k

[3] 本文应用的数据集 https://huggingface.co/datasets/iamtarun/python_code_instruct…

[4] LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685

[5]. QLoRa: Efficient Finetuning of QuantizedLLMs arXiv:2305.14314

https://avoid.overfit.cn/post/9794c9eef1df4e55adf514b3d727ee3b

作者：Eduardo Muñoz

关于人工智能:微调llama2模型教程创建自己的Python代码生成器

一些知识点

数据集

创立提醒

微调模型

PEFT、Lora 和 QLora

微调流程

合并权重

推理

总结