关于pytorch:Vision-Transformers的注意力层概念解释和代码实现

2017 年推出《Attention is All You Need》以来，transformers 曾经成为自然语言解决 (NLP) 的最新技术。2021 年，《An Image is Worth 16×16 Words》，胜利地将 transformers 用于计算机视觉工作。从那时起，许多基于 transformers 的计算机视觉体系结构被提出。

本文将深入探讨注意力层在计算机视觉环境中的工作原理。咱们将探讨单头注意力和多头注意力。它包含注意力层的代码，以及根底数学的概念解释。

在 NLP 利用中，注意力通常被形容为句子中单词 (标记) 之间的关系。而在计算机视觉应用程序中，注意力关注图像中 patches (标记)之间的关系。

有多种办法能够将图像合成为一系列标记。原始的 ViT²将图像宰割成小块，而后将小块平摊成标记。《token -to- token ViT》³开发了一种更简单的从图像创立标记的办法。

《Attention is All You Need》中定义的点积 (相当于乘法) 注意力是目前咱们最常见也是最简略的一种中注意力机制，他的代码实现非常简单：

 classAttention(nn.Module):
     def__init__(self, 
                 dim: int,
                 chan: int,
                 num_heads: int=1,
                 qkv_bias: bool=False,
                 qk_scale: NoneFloat=None):
 
         """ Attention Module
 
             Args:
                 dim (int): input size of a single token
                 chan (int): resulting size of a single token (channels)
                 num_heads(int): number of attention heads in MSA
                 qkv_bias (bool): determines if the qkv layer learns an addative bias
                 qk_scale (NoneFloat): value to scale the queries and keys by; 
                                     if None, queries and keys are scaled by ``head_dim ** -0.5``
         """
 
         super().__init__()
 
         ## Define Constants
         self.num_heads=num_heads
         self.chan=chan
         self.head_dim=self.chan//self.num_heads
         self.scale=qk_scaleorself.head_dim**-0.5
         assertself.chan%self.num_heads==0, '"Chan" must be evenly divisible by "num_heads".'
 
         ## Define Layers
         self.qkv=nn.Linear(dim, chan*3, bias=qkv_bias)
         #### Each token gets projected from starting length (dim) to channel length (chan) 3 times (for each Q, K, V)
         self.proj=nn.Linear(chan, chan)
 
     defforward(self, x):
         B, N, C=x.shape
         ## Dimensions: (batch, num_tokens, token_len)
 
         ## Calcuate QKVs
         qkv=self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim).permute(2, 0, 3, 1, 4)
         #### Dimensions: (3, batch, heads, num_tokens, chan/num_heads = head_dim)
         q, k, v=qkv[0], qkv[1], qkv[2]
 
         ## Calculate Attention
         attn= (q*self.scale) @k.transpose(-2, -1)
         attn=attn.softmax(dim=-1)
         #### Dimensions: (batch, heads, num_tokens, num_tokens)
 
         ## Attention Layer
         x= (attn@v).transpose(1, 2).reshape(B, N, self.chan)
         #### Dimensions: (batch, heads, num_tokens, chan)
 
         ## Projection Layers
         x=self.proj(x)
 
         ## Skip Connection Layer
         v=v.transpose(1, 2).reshape(B, N, self.chan)
         x=v+x     
         #### Because the original x has different size with current x, use v to do skip connection
 
         returnx

对于单个注意力头，让咱们逐渐理解向前传递每一个 patch，应用 7 * 7=49 作为起始 patch 大小（因为这是 T2T-ViT 模型中的起始标记大小）。通道数 64 这也是 T2T-ViT 的默认值。而后假如有 100 标记，并且应用批大小为 13 进行前向流传（抉择这两个数值是为了不会与任何其余参数混同）。

 # Define an Input
 token_len=7*7
 channels=64
 num_tokens=100
 batch=13
 x=torch.rand(batch, num_tokens, token_len)
 B, N, C=x.shape
 print('Input dimensions are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\ttoken size:', x.shape[2])
 
 # Define the Module
 A=Attention(dim=token_len, chan=channels, num_heads=1, qkv_bias=False, qk_scale=None)
 A.eval();

输出的维度是这样的额：

 Input dimensions are
    batchsize: 13 
    number of tokens: 100 
    token size: 49

依据查问、键和值矩阵定义的。第一步是通过一个可学习的线性层来计算这些。qkv_bias 项示意这些线性层是否有偏置项。这一步还将标记的长度从输出 49 更改为 chan 参数（64）。

 qkv=A.qkv(x).reshape(B, N, 3, A.num_heads, A.head_dim).permute(2, 0, 3, 1, 4)
 q, k, v=qkv[0], qkv[1], qkv[2]
 print('Dimensions for Queries are\n\tbatchsize:', q.shape[0], '\n\tattention heads:', q.shape[1], '\n\tnumber of tokens:', q.shape[2], '\n\tnew length of tokens:', q.shape[3])
 print('See that the dimensions for queries, keys, and values are all the same:')
 print('\tShape of Q:', q.shape, '\n\tShape of K:', k.shape, '\n\tShape of V:', v.shape)

能够看到查问、键和值的维度是雷同的，13 代表批次，1 是咱们的注意力头数，100 是咱们输出的标记长度（序列长度），64 是咱们的通道数。

 Dimensions for Queries are
    batchsize: 13 
    attention heads: 1 
    number of tokens: 100 
    new length of tokens: 64
 See that the dimensions for queries, keys, and values are all the same:
    Shape of Q: torch.Size([13, 1, 100, 64]) 
    Shape of K: torch.Size([13, 1, 100, 64]) 
    Shape of V: torch.Size([13, 1, 100, 64])

咱们看看可注意力是如何计算的，它被定义为:

Q、K、V 别离为查问、键和值;dₖ是键的维数，它等于键标记的长度，也等于键的长度。

第一步是计算:

而后是

最初

Q·K 的矩阵乘法看起来是这样的

这些就是咱们注意力的次要局部，代码是这样的

 attn= (q*A.scale) @k.transpose(-2, -1)
 print('Dimensions for Attn are\n\tbatchsize:', attn.shape[0], '\n\tattention heads:', attn.shape[1], '\n\tnumber of tokens:', attn.shape[2], '\n\tnumber of tokens:', attn.shape[3])

后果如下：

 Dimensions for Attn are
    batchsize: 13 
    attention heads: 1 
    number of tokens: 100 
    number of tokens: 100

下一步就是计算 A 的 softmax，这不会扭转它的形态。

 attn=attn.softmax(dim=-1)

最初，咱们计算出 A·V=x:

 x=attn@v
 print('Dimensions for x are\n\tbatchsize:', x.shape[0], '\n\tattention heads:', x.shape[1], '\n\tnumber of tokens:', x.shape[2], '\n\tlength of tokens:', x.shape[3])

就失去了咱们最终的后果

 Dimensions for x are
    batchsize: 13 
    attention heads: 1 
    number of tokens: 100 
    length of tokens: 64

因为只有一个头，所以咱们去掉头数 1

 x = x.transpose(1, 2).reshape(B, N, A.chan)

而后咱们将 x 输出一个可学习的线性层，这个线性层不会扭转它的形态。

 x=A.proj(x)

最初咱们实现的跳过连贯

 orig_shape= (batch, num_tokens, token_len)
 curr_shape= (x.shape[0], x.shape[1], x.shape[2])
 v=v.transpose(1, 2).reshape(B, N, A.chan)
 v_shape= (v.shape[0], v.shape[1], v.shape[2])
 print('Original shape of input x:', orig_shape)
 print('Current shape of x:', curr_shape)
 print('Shape of V:', v_shape)
 x=v+x     
 print('After skip connection, dimensions for x are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\tlength of tokens:', x.shape[2])

后果如下：

 Original shape of input x: (13, 100, 49)
 Current shape of x: (13, 100, 64)
 Shape of V: (13, 100, 64)
 After skip connection, dimensions for x are
    batchsize: 13 
    number of tokens: 100 
    length of tokens: 64

这是咱们单头注意力层!

咱们能够扩大到多头留神。在计算机视觉中，这通常被称为多头自注意力(MSA)。咱们不会具体介绍所有步骤，而是关注矩阵形态不同的中央。

对于多头的注意力，注意力头的数量必须能够整除以通道的数量，所以在这个例子中，咱们将应用 4 个留神头。

 # Define an Input
 token_len=7*7
 channels=64
 num_tokens=100
 batch=13
 num_heads=4
 x=torch.rand(batch, num_tokens, token_len)
 B, N, C=x.shape
 print('Input dimensions are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\ttoken size:', x.shape[2])
 
 # Define the Module
 MSA=Attention(dim=token_len, chan=channels, num_heads=num_heads, qkv_bias=False, qk_scale=None)
 MSA.eval();

后果如下：

 Input dimensions are
    batchsize: 13 
    number of tokens: 100 
    token size: 49

计算查问、键和值的过程与单头的过程雷同。然而能够看到标记的新长度是 chan/num_heads。Q、K 和 V 矩阵的总大小没有扭转; 它们的内容只是散布在头部维度上。你能够把它看作是将单个矩阵宰割为多个:

咱们将子矩阵示意为 Qₕ对于查问头 i。

 qkv=MSA.qkv(x).reshape(B, N, 3, MSA.num_heads, MSA.head_dim).permute(2, 0, 3, 1, 4)
 q, k, v=qkv[0], qkv[1], qkv[2]
 print('Head Dimension = chan / num_heads =', MSA.chan, '/', MSA.num_heads, '=', MSA.head_dim)
 print('Dimensions for Queries are\n\tbatchsize:', q.shape[0], '\n\tattention heads:', q.shape[1], '\n\tnumber of tokens:', q.shape[2], '\n\tnew length of tokens:', q.shape[3])
 print('See that the dimensions for queries, keys, and values are all the same:')
 print('\tShape of Q:', q.shape, '\n\tShape of K:', k.shape, '\n\tShape of V:', v.shape)

输入如下：

 Head Dimension = chan / num_heads = 64 / 4 = 16
 Dimensions for Queries are
    batchsize: 13 
    attention heads: 4 
    number of tokens: 100 
    new length of tokens: 16
 See that the dimensions for queries, keys, and values are all the same:
    Shape of Q: torch.Size([13, 4, 100, 16]) 
    Shape of K: torch.Size([13, 4, 100, 16]) 
    Shape of V: torch.Size([13, 4, 100, 16])

这里须要留神的是

咱们须要除以头数。num_heads = 4 个不同的 Attn 矩阵，看起来像:

 attn= (q*MSA.scale) @k.transpose(-2, -1)
 print('Dimensions for Attn are\n\tbatchsize:', attn.shape[0], '\n\tattention heads:', attn.shape[1], '\n\tnumber of tokens:', attn.shape[2], '\n\tnumber of tokens:', attn.shape[3]

维度：

 Dimensions for Attn are
    batchsize: 13 
    attention heads: 4 
    number of tokens: 100 
    number of tokens: 100

softmax 不会扭转维度，咱们略过，而后计算每一个头

这在多个留神头中是这样的:

 attn = attn.softmax(dim=-1)
 
 x = attn @ v
 print('Dimensions for x are\n\tbatchsize:', x.shape[0], '\n\tattention heads:', x.shape[1], '\n\tnumber of tokens:', x.shape[2], '\n\tlength of tokens:', x.shape[3]

维度如下：

 Dimensions for x are
    batchsize: 13 
    attention heads: 4 
    number of tokens: 100 
    length of tokens: 16

最初须要维度重塑并把把所有的 xₕ` s 连贯在一起。这是第一步的逆操作:

 x=x.transpose(1, 2).reshape(B, N, MSA.chan)
 print('Dimensions for x are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\tlength of tokens:', x.shape[2])

后果如下：

 Dimensions for x are
    batchsize: 13 
    number of tokens: 100 
    length of tokens: 64

咱们曾经将所有头的输入连贯在一起，注意力模块的其余部分放弃不变。

 x = MSA.proj(x)
 print('Dimensions for x are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\tlength of tokens:', x.shape[2])
 
 orig_shape = (batch, num_tokens, token_len)
 curr_shape = (x.shape[0], x.shape[1], x.shape[2])
 v = v.transpose(1, 2).reshape(B, N, A.chan)
 v_shape = (v.shape[0], v.shape[1], v.shape[2])
 print('Original shape of input x:', orig_shape)
 print('Current shape of x:', curr_shape)
 print('Shape of V:', v_shape)
 x = v + x     
 print('After skip connection, dimensions for x are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\tlength of tokens:', x.shape[2])

后果如下：

 Dimensions for x are
    batchsize: 13 
    number of tokens: 100 
    length of tokens: 64
 Original shape of input x: (13, 100, 49)
 Current shape of x: (13, 100, 64)
 Shape of V: (13, 100, 64)
 After skip connection, dimensions for x are
    batchsize: 13 
    number of tokens: 100 
    length of tokens: 64

这就是多头注意力!

在这篇文章中咱们实现了 ViT 中注意力层。为了更具体的阐明咱们进行了手动的代码编写，如果要理论的利用，能够应用 PyTorch 中的 torch.nn. multiheadeattention()，因为他的实现要快的多。

最初参考文章：

[1] Vaswani et al (2017). Attention Is All You Need.https://doi.org/10.48550/arXiv.1706.03762

[2] Dosovitskiy et al (2020). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.https://doi.org/10.48550/arXiv.2010.11929

[3] Yuan et al (2021). Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. https://doi.org/10.48550/arXiv.2101.11986GitHub code: https://github.com/yitu-opensource/T2T-ViT

https://avoid.overfit.cn/post/0d526cd56c8842c599b4fe1c9adcfd9f

作者：Skylar Jean Callis

关于pytorch:Vision-Transformers的注意力层概念解释和代码实现

点积注意力

单头注意力

多头注意力

总结