RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?
原始文档:https://www.yuque.com/lart/pa…
从摘要了解论文
For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer is on the rise. However, the quadratic computational cost of self-attention has become a severe problem of practice.
这里指出了 self-attention 构造较高的计算成本。
There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple idea designed using MLPs and hit an accuracy comparable to the Vision Transformer.
引出本文的外围,MLP 架构。
However, the only inductive bias in this architecture is the embedding of tokens.
在 MLP 架构中,惟一引入演绎偏置的地位也就是 token 嵌入的过程。
这里提到演绎偏置在我看来次要是为了向原始的纯 MLP 架构中引入更多的演绎偏置来在视觉工作上实现更好的训练成果。预计本文又要 从卷积架构中借鉴思路了。
Thus, there is still a possibility to build a non-convolutional inductive bias into the architecture itself, and we built in an inductive bias using two simple ideas.
这里次要在强调尽管引入了演绎偏置,但并不是通过卷积构造引入的。那就只能通过对运算过程进行束缚来实现了。
- A way is to divide the token-mixing block vertically and horizontally.
-
Another way is to make spatial correlations denser among some channels of token-mixing.
这里又一次呈现了应用垂直与程度方向对计算进行划分的思路。相似的思维曾经呈现在很多办法中,例如:
-
卷积办法
-
维度拆分
- https://www.yuque.com/lart/architecture/conv#DT9xE
-
-
Axial-Attention Transformer 办法
-
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
- https://www.yuque.com/lart/architecture/nmxfgf#14a5N
-
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
- https://www.yuque.com/lart/architecture/nmxfgf#vxw8d
-
-
MLP 办法
-
Hire-MLP: Vision MLP via Hierarchical Rearrangement
- https://www.yuque.com/lart/papers/lbhadn
-
-
这里的第二点临时不是太直观,看起来时对通道 MLP 进行了改良?
With this approach, we were able to improve the accuracy of the MLP-Mixer while _reducing its parameters and computational complexity_.
毕竟因为分治的策略,将本来凑在一起计算的全连贯改成了沿特定轴向的级联解决。
粗略来看,这使得运算量近似从 $O(2(HW)^2)$ 变成了 $O(H^2) + O(W^2)$。
Compared to other MLP-based models, the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage. In addition, our work indicates that MLP-based models have the potential to replace CNNs by adopting inductive bias. The source code in PyTorch version is available at https://github.com/okojoalg/raft-mlp.
次要内容
能够看到,实际上还是能够看作是对空间 MLP 的调整。
这里将原始的空间与通道 MLP 穿插重叠的构造批改为了垂直、程度、通道三个级联的构造。通过这样的形式,作者们冀望能够引入垂直和程度方向上的属于 2D 图像的有意义的演绎偏置,隐式地假如 程度或者垂直对齐的 patch 序列 有着和 其余的程度或垂直对齐的 patch 序列 有着类似的相关性。此外,在输出到垂直混合块和程度混合块之前,一些通道被连接起来,它们被这两个模块共享。这样做是因为作者们假如某些通道之间存在几何关系(后文将整合失去的这些通道称作Channel Raft,并且假设的是特定距离 $r$ 的通道具备这样的关系)。
Vertical-Mixing Block 的索引模式变动过程:((rh*rw*sr,h,w) -> (sr, rh*h, rw*w) <=> (rw*sr*w, rh*h)
(因为这里是通道和程度方向共享,所以能够等价,而图中绘制的是等价符号左侧的模式),Horizontal-Mixing Block 相似。
针对程度和垂直模块形成的 Raft-Token-Mixing Block,作者给出的代码示例和我下面等式中等价符号右侧内容统一。从代码中能够看到,其中的归一化操作不受通道分组的影响,而间接对原始模式的特色的通道解决。
class RaftTokenMixingBlock(nn.Module):
# b: size of mini -batch, h: height, w: width,
# c: channel, r: size of raft (number of groups), o: c//r,
# e: expansion factor,
# x: input tensor of shape (h, w, c)
def __init__(self):
self.lnv = nn.LayerNorm(c)
self.lnh = nn.LayerNorm(c)
self.fnv1 = nn.Linear(r * h, r * h * e)
self.fnv2 = nn.Linear(r * h * e, r * h)
self.fnh1 = nn.Linear(r * w, r * w * e)
self.fnh2 = nn.Linear(r * w * e, r * w)
def forward(self, x):
"""x: b, hw, c"""
# Vertical-Mixing Block
y = self.lnv(x)
y = rearrange(y, 'b (h w) (r o) -> b (o w) (r h)')
y = self.fcv1(y)
y = F.gelu(y)
y = self.fcv2(y)
y = rearrange(y, 'b (o w) (r h) -> b (h w) (r o)')
y = x + y
# Horizontal-Mixing Block
y = self.lnh(y)
y = rearrange(y, 'b (h w) (r o) -> b (o h) (r w)')
y = self.fch1(y)
y = F.gelu(y)
y = self.fch2(y)
y = rearrange(y, 'b (o h) (r w) -> b (h w) (r o)')
return x + y
对于提出的构造,通过抉择适合的 $r$ 能够让最终的 raft-token-mixing 相较于原始的 token-mixing block 具备更少的参数($r<h’/\sqrt{2}$),更少的 MACs(multiply-accumulate)($r<h’/2^{\frac{1}{4}}$)。这里假设 $h’=w’$,并且 token-mixing block 中同样应用收缩参数 $e$。
试验后果
这里的中,因为模型设定的起因,RaftMLP-12 次要和 Mixer-B/16 和 ViT-B/16 比照。而 RaftMLP-36 则次要和 ResMLP-36 比照。
Although RaftMLP-36 has almost the same parameters and number of FLOPs as ResMLP-36, it is not more accurate than ResMLP-36. However, since RaftMLP and ResMLP have different detailed architectures other than the raft-token-mixing block, the effect of the raft-token-mixing block cannot be directly compared, unlike the comparison with MLP-Mixer. Nevertheless, we can see that raft-token-mixing is working even though the layers are deeper than RaftMLP-12. (对于最初这个模型 36 的比拟,我也没看明确想说个啥,层数更多难道 raft-token-mixing 可能就不起作用了?)
一些扩大与畅想
- token-mixing block 能够扩大到 3D 情景来替换 3D 卷积。这样能够用来解决视频。
- 本文进引入了程度和垂直的空间演绎偏置,以及一些通道的相关性的束缚。然而作者也提到,还能够尝试利用其余的演绎偏置:例如平行不变性(parallel invariance,这个不是太明确),层次性(hierarchy)等。
链接
- 论文:https://arxiv.org/abs/2108.04384
- 代码:https://github.com/okojoalg/raft-mlp