机器学习模型曾经变得越来越大,即便应用经过训练的模型当硬件不合乎模型对它应该运行的冀望时,推理的工夫和内存老本也会飙升。为了缓解这个问题是应用蒸馏能够将网络放大到正当的大小,同时最大限度地缩小性能损失。

咱们在以前的文章中介绍过 DistilBERT [1] 如何引入一种简略而无效的蒸馏技术,该技术能够轻松利用于任何相似 BERT 的模型,但没有给出任何的代码实现,在本篇文章中咱们将进入细节,并给出残缺的代码实现。

学生模型的初始化

因为咱们想从现有模型初始化一个新模型,所以须要拜访旧模型的权重。本文将应用Hugging Face 提供的 RoBERTa [2] large 作为咱们的老师模型,要取得模型权重,必须晓得如何拜访它们。

Hugging Face的模型构造

能够尝试的第一件事是打印模型,这应该让咱们深刻理解它是如何工作的。当然,咱们也能够深入研究 Hugging Face 文档 [3],但这太繁琐了。

from transformers import AutoModelForMaskedLMroberta = AutoModelForMaskedLM.from_pretrained("roberta-large")print(roberta)

运行此代码后失去:

在 Hugging Face 模型中,能够应用 .children() 生成器拜访模块的子组件。因而,如果咱们想应用整个模型,咱们须要在它下面调用 .children() ,并在每个子节点上调用,这是一个递归函数,代码如下:

from typing import Anyfrom transformers import AutoModelForMaskedLMroberta = AutoModelForMaskedLM.from_pretrained("roberta-large")def visualize_children(    object : Any,    level : int = 0,) -> None:    """    Prints the children of (object) and their children too, if there are any.    Uses the current depth (level) to print things in a ordonnate manner.    """    print(f"{'   ' * level}{level}- {type(object).__name__}")    try:        for child in object.children():            visualize_children(child, level + 1)    except:        passvisualize_children(roberta)

这样取得了如下输入

看起来 RoBERTa 模型的构造与其余相似 BERT 的模型一样,如下所示:

复制老师模型的权重

要以 DistilBERT [1] 的形式初始化一个相似 BERT 的模型,咱们只须要复制除最深层的 Roberta 层之外的所有内容,并且删除其中的一半。所以这里的步骤如下:首先,咱们须要创立学生模型,其架构与老师模型雷同,但暗藏层数减半。只须要应用老师模型的配置,这是一个相似字典的对象,形容了Hugging Face模型的架构。查看 roberta.config 属性时,咱们能够看到以下内容:

咱们感兴趣的是numhidden -layers属性。让咱们写一个函数来复制这个配置,通过将其除以2来扭转属性,而后用新的配置创立一个新的模型:

from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel, RobertaConfigdef distill_roberta(    teacher_model : RobertaPreTrainedModel,) -> RobertaPreTrainedModel:    """    Distilates a RoBERTa (teacher_model) like would DistilBERT for a BERT model.    The student model has the same configuration, except for the number of hidden layers, which is // by 2.    The student layers are initilized by copying one out of two layers of the teacher, starting with layer 0.    The head of the teacher is also copied.    """    # Get teacher configuration as a dictionnary    configuration = teacher_model.config.to_dict()    # Half the number of hidden layer    configuration['num_hidden_layers'] //= 2    # Convert the dictionnary to the student configuration    configuration = RobertaConfig.from_dict(configuration)    # Create uninitialized student model    student_model = type(teacher_model)(configuration)    # Initialize the student's weights    distill_roberta_weights(teacher=teacher_model, student=student_model)    # Return the student model    return student_model

这个函数distill_roberta_weights函数将把老师的一半权重放在学生层中,所以依然须要对它进行编码。因为递归在摸索老师模型方面工作得很好,能够应用雷同的思维来摸索和复制某些局部。这里将同时在老师和学生的模型中迭代,并将其从一个到另一个进行复制。惟一须要留神的是暗藏层的局部,只复制一半。

函数如下:

from transformers.models.roberta.modeling_roberta import RobertaEncoder, RobertaModelfrom torch.nn import Moduledef distill_roberta_weights(    teacher : Module,    student : Module,) -> None:    """    Recursively copies the weights of the (teacher) to the (student).    This function is meant to be first called on a RobertaFor... model, but is then called on every children of that model recursively.    The only part that's not fully copied is the encoder, of which only half is copied.    """    # If the part is an entire RoBERTa model or a RobertaFor..., unpack and iterate    if isinstance(teacher, RobertaModel) or type(teacher).__name__.startswith('RobertaFor'):        for teacher_part, student_part in zip(teacher.children(), student.children()):            distill_roberta_weights(teacher_part, student_part)    # Else if the part is an encoder, copy one out of every layer    elif isinstance(teacher, RobertaEncoder):            teacher_encoding_layers = [layer for layer in next(teacher.children())]            student_encoding_layers = [layer for layer in next(student.children())]            for i in range(len(student_encoding_layers)):                student_encoding_layers[i].load_state_dict(teacher_encoding_layers[2*i].state_dict())    # Else the part is a head or something else, copy the state_dict    else:        student.load_state_dict(teacher.state_dict())

这个函数通过递归和类型查看,确保学生模型与 Roberta 层的老师平安模型雷同。如果想在初始化的时候扭转复制哪些层,只须要更改encoder局部的for循环就能够了。

当初咱们有了学生模型,咱们须要对其进行训练。这部分绝对简略,次要的问题就是应用的损失函数。

自定义损失函数

作为对 DistilBERT 训练过程的回顾,先看一下下图:

请把注意力转向下面写着“损失”的红色大盒子。但在具体介绍外面是什么之前,须要晓得如何收集咱们要喂给它的货色。在这张图中能够看到须要 3 个货色:标签、学生和老师的嵌入。标签曾经有了,因为是有监督的学习。当初看啊可能如何失去另外两个。

老师和学生的输出

在这里须要一个函数,给定一个类 BERT 模型的输出,包含两个张量 input_ids 和 attention_mask 以及模型自身,而后函数将返回该模型的 logits。因为咱们应用的是 Hugging Face,这非常简单,咱们须要的惟一常识就是能看懂上面的代码:

from torch import Tensordef get_logits(    model : RobertaPreTrainedModel,     input_ids : Tensor,    attention_mask : Tensor,) -> Tensor:    """    Given a RoBERTa (model) for classification and the couple of (input_ids) and (attention_mask),    returns the logits corresponding to the prediction.    """    return model.classifier(        model.roberta(input_ids, attention_mask)[0]    )

学生和老师都能够应用这个函数,然而第一个有梯度,第二个没有。

损失函数的代码实现

损失函数具体的介绍请见咱们上次公布的文章,这里应用上面的图片进行解释:

咱们所说的“‘converging cosine-loss(收敛余弦损失)”是用于对齐两个输出向量的惯例余弦损失。这是代码:

import torchfrom torch.nn import CrossEntropyLoss, CosineEmbeddingLossdef distillation_loss(    teacher_logits : Tensor,    student_logits : Tensor,    labels : Tensor,    temperature : float = 1.0,) -> Tensor:    """    The distillation loss for distilating a BERT-like model.    The loss takes the (teacher_logits), (student_logits) and (labels) for various losses.    The (temperature) can be given, otherwise it's set to 1 by default.    """    # Temperature and sotfmax    student_logits, teacher_logits = (student_logits / temperature).softmax(1), (teacher_logits / temperature).softmax(1)    # Classification loss (problem-specific loss)    loss = CrossEntropyLoss()(student_logits, labels)    # CrossEntropy teacher-student loss    loss = loss + CrossEntropyLoss()(student_logits, teacher_logits)    # Cosine loss    loss = loss + CosineEmbeddingLoss()(teacher_logits, student_logits, torch.ones(teacher_logits.size()[0]))    # Average the loss and return it    loss = loss / 3    return loss

以上就是 DistilBERT 的所有要害思维的实现,然而还短少一些货色,比方 GPU 反对、整个训练例程等,所以最初残缺的代码会在文章的最初提供,如果须要理论应用,倡议应用最初的 Distillator 类。

后果

以这种形式提炼进去的模型最终体现如何呢?对于 DistilBERT,能够浏览原始论文 [1]。对于 RoBERTa,Hugging Face 上曾经存在相似 DistilBERT 的蒸馏版本。在 GLUE 基准 [4] 上,咱们能够比拟两个模型:

至于工夫和内存老本,这个模型大概是 roberta-base 大小的三分之二,速度是两倍。

总结

通过以上的代码咱们能够蒸馏任何相似 BERT 的模型。 除此以外还有很多其余更好的办法,例如 TinyBERT [5] 或 MobileBERT [6]。如果你认为其中一篇更适宜您的需要,你应该浏览这些文章。甚至是齐全尝试一种新的蒸馏办法,因为这是一个日益倒退的畛域。

本文的代码在这里:

https://www.overfit.cn/post/6583351575974a5993a4ebd98b51088e

援用

[1] Victor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019), Hugging Face

[2] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach (2019), arXiv

[3] Hugging Face team crediting Julien Chaumond, Hugging Face’s RoBERTa documentation, Hugging Face

[4] Alex WANG, Amanpreet SINGH, Julian MICHAEL, Felix HILL, Omer LEVY, Samuel R. BOWMAN, GLUE: A multi-task benchmark and analysis platform for natural language understanding (2019), arXiv

[5] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu, TinyBERT: Distilling BERT for Natural Language Understanding (2019), arXiv

[6] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou, MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices (2020), arXiv