乐趣区

关于深度学习:Vision-MLP之RaftMLP-Do-MLPbased-Models-Dream-of-Winning-Over-CV

RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?

原始文档:https://www.yuque.com/lart/pa…

从摘要了解论文

For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer is on the rise. However, the quadratic computational cost of self-attention has become a severe problem of practice.

这里指出了 self-attention 构造较高的计算成本。

There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple idea designed using MLPs and hit an accuracy comparable to the Vision Transformer.

引出本文的外围,MLP 架构。

However, the only inductive bias in this architecture is the embedding of tokens.

在 MLP 架构中,惟一引入演绎偏置的地位也就是 token 嵌入的过程。
这里提到演绎偏置在我看来次要是为了向原始的纯 MLP 架构中引入更多的演绎偏置来在视觉工作上实现更好的训练成果。预计本文又要 从卷积架构中借鉴思路了

Thus, there is still a possibility to build a non-convolutional inductive bias into the architecture itself, and we built in an inductive bias using two simple ideas.

这里次要在强调尽管引入了演绎偏置,但并不是通过卷积构造引入的。那就只能通过对运算过程进行束缚来实现了。

  1. A way is to divide the token-mixing block vertically and horizontally.
  2. Another way is to make spatial correlations denser among some channels of token-mixing.

    这里又一次呈现了应用垂直与程度方向对计算进行划分的思路。相似的思维曾经呈现在很多办法中,例如:

    • 卷积办法

      • 维度拆分

        • https://www.yuque.com/lart/architecture/conv#DT9xE
    • Axial-Attention Transformer 办法

      • Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation

        • https://www.yuque.com/lart/architecture/nmxfgf#14a5N
      • CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

        • https://www.yuque.com/lart/architecture/nmxfgf#vxw8d
    • MLP 办法

      • Hire-MLP: Vision MLP via Hierarchical Rearrangement

        • https://www.yuque.com/lart/papers/lbhadn

这里的第二点临时不是太直观,看起来时对通道 MLP 进行了改良?

With this approach, we were able to improve the accuracy of the MLP-Mixer while _reducing its parameters and computational complexity_.

毕竟因为分治的策略,将本来凑在一起计算的全连贯改成了沿特定轴向的级联解决。
粗略来看,这使得运算量近似从 $O(2(HW)^2)$ 变成了 $O(H^2) + O(W^2)$。

Compared to other MLP-based models, the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage. In addition, our work indicates that MLP-based models have the potential to replace CNNs by adopting inductive bias. The source code in PyTorch version is available at https://github.com/okojoalg/raft-mlp.

次要内容

能够看到,实际上还是能够看作是对空间 MLP 的调整。

这里将原始的空间与通道 MLP 穿插重叠的构造批改为了垂直、程度、通道三个级联的构造。通过这样的形式,作者们冀望能够引入垂直和程度方向上的属于 2D 图像的有意义的演绎偏置,隐式地假如 程度或者垂直对齐的 patch 序列 有着和 其余的程度或垂直对齐的 patch 序列 有着类似的相关性。此外,在输出到垂直混合块和程度混合块之前,一些通道被连接起来,它们被这两个模块共享。这样做是因为作者们假如某些通道之间存在几何关系(后文将整合失去的这些通道称作Channel Raft,并且假设的是特定距离 $r$ 的通道具备这样的关系)。

Vertical-Mixing Block 的索引模式变动过程:((rh*rw*sr,h,w) -> (sr, rh*h, rw*w) <=> (rw*sr*w, rh*h)(因为这里是通道和程度方向共享,所以能够等价,而图中绘制的是等价符号左侧的模式),Horizontal-Mixing Block 相似。

针对程度和垂直模块形成的 Raft-Token-Mixing Block,作者给出的代码示例和我下面等式中等价符号右侧内容统一。从代码中能够看到,其中的归一化操作不受通道分组的影响,而间接对原始模式的特色的通道解决。

class RaftTokenMixingBlock(nn.Module):
    # b: size of mini -batch, h: height, w: width,
    # c: channel, r: size of raft (number of groups), o: c//r,
    # e: expansion factor,
    # x: input tensor of shape (h, w, c)
    def __init__(self):
        self.lnv = nn.LayerNorm(c)
        self.lnh = nn.LayerNorm(c)
        self.fnv1 = nn.Linear(r * h, r * h * e)
        self.fnv2 = nn.Linear(r * h * e, r * h)
        self.fnh1 = nn.Linear(r * w, r * w * e)
        self.fnh2 = nn.Linear(r * w * e, r * w)

    def forward(self, x):
        """x: b, hw, c"""
        # Vertical-Mixing Block
        y = self.lnv(x)
        y = rearrange(y, 'b (h w) (r o) -> b (o w) (r h)')
        y = self.fcv1(y)
        y = F.gelu(y)
        y = self.fcv2(y)
        y = rearrange(y, 'b (o w) (r h) -> b (h w) (r o)')
        y = x + y

        # Horizontal-Mixing Block
        y = self.lnh(y)
        y = rearrange(y, 'b (h w) (r o) -> b (o h) (r w)')
        y = self.fch1(y)
        y = F.gelu(y)
        y = self.fch2(y)
        y = rearrange(y, 'b (o h) (r w) -> b (h w) (r o)')
        return x + y

对于提出的构造,通过抉择适合的 $r$ 能够让最终的 raft-token-mixing 相较于原始的 token-mixing block 具备更少的参数($r<h’/\sqrt{2}$),更少的 MACs(multiply-accumulate)($r<h’/2^{\frac{1}{4}}$)。这里假设 $h’=w’$,并且 token-mixing block 中同样应用收缩参数 $e$。

试验后果

这里的中,因为模型设定的起因,RaftMLP-12 次要和 Mixer-B/16 和 ViT-B/16 比照。而 RaftMLP-36 则次要和 ResMLP-36 比照。

Although RaftMLP-36 has almost the same parameters and number of FLOPs as ResMLP-36, it is not more accurate than ResMLP-36. However, since RaftMLP and ResMLP have different detailed architectures other than the raft-token-mixing block, the effect of the raft-token-mixing block cannot be directly compared, unlike the comparison with MLP-Mixer. Nevertheless, we can see that raft-token-mixing is working even though the layers are deeper than RaftMLP-12. (对于最初这个模型 36 的比拟,我也没看明确想说个啥,层数更多难道 raft-token-mixing 可能就不起作用了?)

一些扩大与畅想

  • token-mixing block 能够扩大到 3D 情景来替换 3D 卷积。这样能够用来解决视频。
  • 本文进引入了程度和垂直的空间演绎偏置,以及一些通道的相关性的束缚。然而作者也提到,还能够尝试利用其余的演绎偏置:例如平行不变性(parallel invariance,这个不是太明确),层次性(hierarchy)等。

链接

  • 论文:https://arxiv.org/abs/2108.04384
  • 代码:https://github.com/okojoalg/raft-mlp
退出移动版