关于深度学习:Vision-MLP之RaftMLP-Do-MLPbased-Models-Dream-of-Winning-Over-CV

原始文档：https://www.yuque.com/lart/pa…

For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer is on the rise. However, the quadratic computational cost of self-attention has become a severe problem of practice.

这里指出了 self-attention 构造较高的计算成本。

There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple idea designed using MLPs and hit an accuracy comparable to the Vision Transformer.

引出本文的外围，MLP 架构。

However, the only inductive bias in this architecture is the embedding of tokens.

在 MLP 架构中，惟一引入演绎偏置的地位也就是 token 嵌入的过程。
这里提到演绎偏置在我看来次要是为了向原始的纯 MLP 架构中引入更多的演绎偏置来在视觉工作上实现更好的训练成果。预计本文又要 从卷积架构中借鉴思路了。

Thus, there is still a possibility to build a non-convolutional inductive bias into the architecture itself, and we built in an inductive bias using two simple ideas.

这里次要在强调尽管引入了演绎偏置，但并不是通过卷积构造引入的。那就只能通过对运算过程进行束缚来实现了。

A way is to divide the token-mixing block vertically and horizontally.
Another way is to make spatial correlations denser among some channels of token-mixing.
这里又一次呈现了应用垂直与程度方向对计算进行划分的思路。相似的思维曾经呈现在很多办法中，例如：
- 卷积办法
  - 维度拆分
    - https://www.yuque.com/lart/architecture/conv#DT9xE
- Axial-Attention Transformer 办法
  - Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
    - https://www.yuque.com/lart/architecture/nmxfgf#14a5N
  - CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
    - https://www.yuque.com/lart/architecture/nmxfgf#vxw8d
- MLP 办法
  - Hire-MLP: Vision MLP via Hierarchical Rearrangement
    - https://www.yuque.com/lart/papers/lbhadn

这里的第二点临时不是太直观，看起来时对通道 MLP 进行了改良？

With this approach, we were able to improve the accuracy of the MLP-Mixer while _reducing its parameters and computational complexity_.

毕竟因为分治的策略，将本来凑在一起计算的全连贯改成了沿特定轴向的级联解决。
粗略来看，这使得运算量近似从 $O(2(HW)^2)$ 变成了 $O(H^2) + O(W^2)$。

Compared to other MLP-based models, the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage. In addition, our work indicates that MLP-based models have the potential to replace CNNs by adopting inductive bias. The source code in PyTorch version is available at https://github.com/okojoalg/raft-mlp.

能够看到，实际上还是能够看作是对空间 MLP 的调整。

这里将原始的空间与通道 MLP 穿插重叠的构造批改为了垂直、程度、通道三个级联的构造。通过这样的形式，作者们冀望能够引入垂直和程度方向上的属于 2D 图像的有意义的演绎偏置，隐式地假如 程度或者垂直对齐的 patch 序列 有着和 其余的程度或垂直对齐的 patch 序列 有着类似的相关性。此外，在输出到垂直混合块和程度混合块之前，一些通道被连接起来，它们被这两个模块共享。这样做是因为作者们假如某些通道之间存在几何关系（后文将整合失去的这些通道称作Channel Raft，并且假设的是特定距离 $r$ 的通道具备这样的关系）。

Vertical-Mixing Block 的索引模式变动过程：((rh*rw*sr,h,w) -> (sr, rh*h, rw*w) <=> (rw*sr*w, rh*h)（因为这里是通道和程度方向共享，所以能够等价，而图中绘制的是等价符号左侧的模式），Horizontal-Mixing Block 相似。

针对程度和垂直模块形成的 Raft-Token-Mixing Block，作者给出的代码示例和我下面等式中等价符号右侧内容统一。从代码中能够看到，其中的归一化操作不受通道分组的影响，而间接对原始模式的特色的通道解决。

class RaftTokenMixingBlock(nn.Module):
    # b: size of mini -batch, h: height, w: width,
    # c: channel, r: size of raft (number of groups), o: c//r,
    # e: expansion factor,
    # x: input tensor of shape (h, w, c)
    def __init__(self):
        self.lnv = nn.LayerNorm(c)
        self.lnh = nn.LayerNorm(c)
        self.fnv1 = nn.Linear(r * h, r * h * e)
        self.fnv2 = nn.Linear(r * h * e, r * h)
        self.fnh1 = nn.Linear(r * w, r * w * e)
        self.fnh2 = nn.Linear(r * w * e, r * w)

    def forward(self, x):
        """x: b, hw, c"""
        # Vertical-Mixing Block
        y = self.lnv(x)
        y = rearrange(y, 'b (h w) (r o) -> b (o w) (r h)')
        y = self.fcv1(y)
        y = F.gelu(y)
        y = self.fcv2(y)
        y = rearrange(y, 'b (o w) (r h) -> b (h w) (r o)')
        y = x + y

        # Horizontal-Mixing Block
        y = self.lnh(y)
        y = rearrange(y, 'b (h w) (r o) -> b (o h) (r w)')
        y = self.fch1(y)
        y = F.gelu(y)
        y = self.fch2(y)
        y = rearrange(y, 'b (o h) (r w) -> b (h w) (r o)')
        return x + y

对于提出的构造，通过抉择适合的 $r$ 能够让最终的 raft-token-mixing 相较于原始的 token-mixing block 具备更少的参数（$r<h’/\sqrt{2}$），更少的 MACs（multiply-accumulate）（$r<h’/2^{\frac{1}{4}}$）。这里假设 $h’=w’$，并且 token-mixing block 中同样应用收缩参数 $e$。

这里的中，因为模型设定的起因，RaftMLP-12 次要和 Mixer-B/16 和 ViT-B/16 比照。而 RaftMLP-36 则次要和 ResMLP-36 比照。

Although RaftMLP-36 has almost the same parameters and number of FLOPs as ResMLP-36, it is not more accurate than ResMLP-36. However, since RaftMLP and ResMLP have different detailed architectures other than the raft-token-mixing block, the effect of the raft-token-mixing block cannot be directly compared, unlike the comparison with MLP-Mixer. Nevertheless, we can see that raft-token-mixing is working even though the layers are deeper than RaftMLP-12. (对于最初这个模型 36 的比拟，我也没看明确想说个啥，层数更多难道 raft-token-mixing 可能就不起作用了？)

token-mixing block 能够扩大到 3D 情景来替换 3D 卷积。这样能够用来解决视频。
本文进引入了程度和垂直的空间演绎偏置，以及一些通道的相关性的束缚。然而作者也提到，还能够尝试利用其余的演绎偏置：例如平行不变性（parallel invariance，这个不是太明确），层次性（hierarchy）等。

论文：https://arxiv.org/abs/2108.04384
代码：https://github.com/okojoalg/raft-mlp

关于深度学习:Vision-MLP之RaftMLP-Do-MLPbased-Models-Dream-of-Winning-Over-CV

RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?

从摘要了解论文

次要内容

试验后果

一些扩大与畅想

链接