关于人工智能:RepVGG论文详解以及使用Pytorch进行模型复现

RepVGG: Making VGG-style ConvNets Great Again 是 2021 CVPR 的一篇论文，正如他的名字一样，应用 structural re-parameterization 的形式让类 VGG 的架构从新取得了最好的性能和更快的速度。在本文中首先对论文进行具体的介绍，而后再应用 Pytorch 复现 RepVGG 模型.

1、多分支模型的问题

速度：

上图能够看到 3×3 conv 的实践计算密度大概是其余计算密度的 4 倍，这表明实践总 FLOPs 不是不同架构之间的理论速度的可比指标。例如，VGG-16 比 effentnet – b3 大 8.4×，但在 1080Ti 上运行却速度快 1.8×。

在 Inception 的主动生成架构中，应用多个小的操作符，而不是几个大的操作符的多分支拓扑被宽泛采纳。

NASNet- A 中的碎片化量为 13，这对 GPU 等具备弱小并行计算能力的设施不敌对。

内存：

多分支的内存效率很低，因为每个分支的后果都须要保留到残差连贯或连贯为止，这会显著进步内存占用的峰值。上图显示，一个残差块的输出须要始终放弃到加法。假如块放弃 feature map 的大小，额定内存占用的峰值为为输出的 2 倍。

2、RepVGG

(a) ResNet: 它在训练和推理过程中都失去了多路径拓扑，速度慢，内存效率低。

(b) RepVGG 训练 : 仅在训练时失去多路径拓扑。

对于多分支，ResNets 胜利解释了这样的多分支架构使模型隐式地集成了许多较浅的模型。具体来说，当有 n 个块时，模型能够解释为 2^n 个模型的汇合，因为每个块都将流分支为两条门路。因为多分支拓扑在推理方面存在缺点，然而分支有利于训练，因而应用多分支来实现泛滥模型的集成只在训练时破费很多工夫。

repvgg 应用相似于 identity 层（尺寸匹配时，输出就是输入，不做操作）和 1×1 卷积，因而构建块的训练工夫信息流为 y = x+g（x）+f（x），如上图的（b）。所以模型变成了 3^n 个子模型的汇合，蕴含 n 个这样的块。

为一般推断工夫模型从新设置参数：

BN 在每个分支中都在加法之前应用。

设大小为 C2×C1×3×3 的 W(3)示意 3×3 核，其 C1 输出通道和 C2 输入通道，而大小为 C2×C1 的 W(1)示意 1×1 分支核

μ(3)、σ(3)、γ(3)、β(3)别离为 3×3 卷积后 BN 层的累积均值、标准差、学习尺度因子和偏差。

1×1 conv 后的 BN 参数与 μ(1)、σ(1)、γ(1)、β(1)类似，同分支的 BN 参数与 μ(0)、(0)、γ(0)、β(0)类似。

设 M(1)的大小为 N×C1×H1×W1, M(2)的大小为 N×C2×H2×W2，别离为输出和输入，设 * 为卷积算子。

如果 C1=C2, H1=H2, W1=W2，咱们失去:

式中 bn 为推理工夫 bn 函数:

BN 与 Conv 合并：首先将每一个 BN 及其前一卷积层转换为带有偏置矢量的卷积。设 {W ‘，b ‘} 为转换后的核和偏置:

则推理时 bn 为:

所有分支合并：这种转换也实用于 identity 分支，因为能够将 identity 层视为 1×1 conv，将单位矩阵作为核。在这些转换之后将领有一个 3×3 核、两个 1×1 内核和三个偏置向量。而后咱们将三个偏置向量相加，失去最终的偏置。最初是 3×3 核，将 1×1 核增加到 3×3 核的中心点上，这能够通过将两个 1×1 内核的零填充到 3×3 并将三个核相加来实现，如上图所示。

RepVGG 架构如下

3×3 层分为 5 个阶段，阶段的第一层则是 stride= 2。为了进行图像分类，全局均匀合并后，而后将完连贯的层用作分类头。对于其余工作，特定于工作的部能够在任何一层产生的特色上应用（例如宰割、检测须要的多重特色）。

五个阶段别离具备 1、2、4、14、1 层，构建名称为 RepVGG-B。

更深的 RepVGG-B，在第 2、3 和 4 阶段中有 2 层。

也能够应用不同的 a 和 b 产生不同的变体。A 用于缩放前四个阶段，而 B 用于最初阶段，然而要保障 b > a。为了进一步缩小参数和计算量，采纳了 interleave groupwise 的 3×3 卷积层以换取效率。其中，RepVGG- A 的第 3、5、7、…、21 层以及 RepVGG- B 额定的第 23、25、27 层设置组数 g。为了简略起见，对于这些层，g 被全局地设置为 1、2 或 4，而没有进行分层调整。

3、试验后果

REPVGG-A0 在准确性和速度方面比 RESNET-18 好 1.25%和 33%，REPVGGA1 比 RESNET-34 好 0.29%/64%，REPVGG-A2 比 Resnet-50 好 0.17%/83%。

通过分组层 (g2/g4) 的交织解决，RepVGG 模型的速度进一步放慢，精度降落较为正当:RepVGG- b1g4 比 ResNet-101 进步了 0.37%/101%，RepVGGB1g2 在精度雷同的状况下比 ResNet-152 进步了 2.66 倍。

尽管参数的数量不是次要问题，但能够看到以上所有的 RepVGG 模型都比 ResNets 更无效地利用参数。

与经典的 VGG-16 相比，RepVGG-B2 的参数仅为 58%，运行速度进步 10%，准确率进步 6.57%。

RepVGG 模型在 200 个 epoch 的精度达到 80% 以上。RepVGG-A2 比 effecentnet – b0 性能好 1.37%/59%，RepVGG-B1 比 RegNetX-3.2GF 性能好 0.39%，运行速度也略快。

4、融化钻研

去除上图所示的这两个分支后，训练工夫模型进化为一般模型，准确率仅为 72.39%。

应用仅应用 1×1 卷积和 identity 层精度都有所降落为 74.79% 和 73.15%

全功能 RepVGGB0 模型的准确率为 75.14%，比一般一般模型高出 2.75%。

宰割：

上图为应用批改后的 PSPNET 框架后果，批改后的 PSPNET 的运行速度比 Resnet-50/101-backbone 快得多。REPVGG 的 backbone 体现都优于 Resnet-50 和 Resnet-101。

上面咱们开始应用 Pytorch 实现

1、单与多分支模型

要实现 RepVGG 首先就要理解多分支，多分支就是其中输出通过不同的层，而后以某种形式汇总（通常是相加）。

论文中也提到了它使泛滥较浅模型的隐式汇合制作了多分支模型。更具体地说，该模型能够解释为 2^n 模型的汇合，因为每个块将流量分为两个门路。

多分支模型比单分支的模型更慢并且须要耗费更多的内存。咱们先创立一个经典的块来理解起因

import torch
from torch import nn, Tensor
from torchvision.ops import Conv2dNormActivation
from typing import Dict, List

torch.manual_seed(0)

class ResNetBlock(nn.Module):
    def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
        super().__init__()
        self.weight = nn.Sequential(
            Conv2dNormActivation(in_channels, out_channels, kernel_size=3, stride=stride),
            Conv2dNormActivation(out_channels, out_channels, kernel_size=3, activation_layer=None),
        )
        self.shortcut = (
            Conv2dNormActivation(
                in_channels,
                out_channels,
                kernel_size=1,
                stride=stride,
                activation_layer=None,
            )
            if in_channels != out_channels
            else nn.Identity())

        self.act = nn.ReLU(inplace=True)

    def forward(self, x):
        res = self.shortcut(x)  # <- 2x memory
        x = self.weight(x)
        x += res
        x = self.act(x)  # <- 1x memory
        return x

存储残差会的有 2 倍的内存耗费。在上面的图像中，应用下面的图

多分支的构造仅在训练时才有用。因而，如果能够在预测工夫删除它，是能够改善模型速度和内存耗费的，咱们来看看代码怎么做：

2、从多分支到单分支

思考以下状况，有两个由两个 3 ×3 Convs 组成的分支

class TwoBranches(nn.Module):
    def __init__(self, in_channels: int, out_channels: int):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3)
        self.conv2 = nn.Conv2d(in_channels, out_channels, kernel_size=3)
        
    def forward(self, x):
        x1 = self.conv1(x)
        x2 = self.conv2(x)
        return x1 + x2

看看后果

two_branches = TwoBranches(8, 8)

x = torch.randn((1, 8, 7, 7))

two_branches(x).shape

torch.Size([1, 8, 5, 5])

当初，创立一个 convv，咱们称其为“conv_fused”，conv_fused(x) = conv1(x) + conv2(x)。咱们能够将两个卷积的权重和偏置求和，依据卷积的个性这是没问题的。

conv1 = two_branches.conv1
conv2 = two_branches.conv2

conv_fused = nn.Conv2d(conv1.in_channels, conv1.out_channels, kernel_size=conv1.kernel_size)

conv_fused.weight = nn.Parameter(conv1.weight + conv2.weight)
conv_fused.bias =  nn.Parameter(conv1.bias + conv2.bias)

# check they give the same output
assert torch.allclose(two_branches(x), conv_fused(x), atol=1e-5)

让咱们对它的速度！

from time import perf_counter

two_branches.to("cuda")
conv_fused.to("cuda")

with torch.no_grad():
    x = torch.randn((4, 8, 7, 7), device=torch.device("cuda"))
    
    start = perf_counter()
    two_branches(x)
    print(f"conv1(x) + conv2(x) tooks {perf_counter() - start:.6f}s")
    
    start = perf_counter()
    conv_fused(x)
    print(f"conv_fused(x) tooks {perf_counter() - start:.6f}s")

速度快了一倍

conv1(x) + conv2(x) tooks 0.000421s
conv_fused(x) tooks 0.000215s

3、Fuse Conv 和 Batschorm

BATGNORM 被用作卷积块之后层。论文中将它们交融在一起，即 conv_fused(x) = batchnorm(conv(x))。

论文的 2 个公式解释这里截图在一起了，为了不便查看：

代码是这样的：

def get_fused_bn_to_conv_state_dict(conv: nn.Conv2d, bn: nn.BatchNorm2d) -> Dict[str, Tensor]:
    # in the paper, weights is gamma and bias is beta
    bn_mean, bn_var, bn_gamma, bn_beta = (
        bn.running_mean,
        bn.running_var,
        bn.weight,
        bn.bias,
    )
    # we need the std!
    bn_std = (bn_var + bn.eps).sqrt()
    # eq (3)
    conv_weight = nn.Parameter((bn_gamma / bn_std).reshape(-1, 1, 1, 1) * conv.weight)
    # still eq (3)
    conv_bias = nn.Parameter(bn_beta - bn_mean * bn_gamma / bn_std)
    return {"weight": conv_weight, "bias": conv_bias}

让咱们看看它怎么工作：

conv_bn = nn.Sequential(nn.Conv2d(8, 8, kernel_size=3, bias=False),
    nn.BatchNorm2d(8)
)

torch.nn.init.uniform_(conv_bn[1].weight)
torch.nn.init.uniform_(conv_bn[1].bias)

with torch.no_grad():
    # be sure to switch to eval mode!!
    conv_bn = conv_bn.eval()
    conv_fused = nn.Conv2d(conv_bn[0].in_channels, 
                           conv_bn[0].out_channels, 
                           kernel_size=conv_bn[0].kernel_size)

    conv_fused.load_state_dict(get_fused_bn_to_conv_state_dict(conv_bn[0], conv_bn[1]))

    x = torch.randn((1, 8, 7, 7))
    
    assert torch.allclose(conv_bn(x), conv_fused(x), atol=1e-5)

论文就是这样的形式交融了 Conv2D 和 BatchRorm2D 层。

其实能够看到论文的指标是一个：将整个模型交融成在一个繁多的数据流中（没有分支），使网络更快！

作者提出新的 RepVgg 块。与 ResNet 相似是有残差的，但通过 identity 层使其更快.

持续下面的图,pytorch 的代码如下：

class RepVGGBlock(nn.Module):
    def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
        super().__init__()
        self.block = Conv2dNormActivation(
            in_channels,
            out_channels,
            kernel_size=3,
            padding=1,
            bias=False,
            stride=stride,
            activation_layer=None,
            # the original model may also have groups > 1
        )

        self.shortcut = Conv2dNormActivation(
            in_channels,
            out_channels,
            kernel_size=1,
            stride=stride,
            activation_layer=None,
        )

        self.identity = (nn.BatchNorm2d(out_channels) if in_channels == out_channels else None
        )

        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        res = x  # <- 2x memory
        x = self.block(x)
        x += self.shortcut(res)
        if self.identity:
            x += self.identity(res)
        x = self.relu(x)  # <- 1x memory
        return x

4、参数的重塑

一个 3 ×3 conv-> bn，一个 1 ×1 conv-bn 和（有时）一个 batchnorm（identity 分支）。要想将它们交融在一起，创立一个 conv_fused，conv_fused

3x3conv-bn(x) + 1x1conv-bn(x) + bn(x)，或者如果没有 identity 层，conv_fused

3x3conv-bn(x) + 1x1conv-bn(x)。

为了创立这个 conv_fused，咱们须要做如下的操作：

将 3x3conv-bn（x）交融到一个 3x3conv 中
1x1conv-bn（x），而后将其转换为 3x3conv
将 identity 的 BN 转换为 3x3conv
所有三个 3x3convs 相加

下图就是论文的总结：

第一步很容易，咱们能够在 RepVGGBlock.block（主 3 ×3 Conver-bn）上应用 get_fused_bn_to_conv_state_dict。

第二步也相似的，在 RepVGGBlock.shortcut 上（1×1 cons-bn）应用 get_fused_bn_to_conv_state_dict。这就是论文说的在每个维度上用 1 填充交融的 1 ×1 的核，造成一个 3 ×3。

identity 的 bn 比拟麻烦。论文的技巧（trick）是创立 3 ×3 Conv 来模仿 identity，它将作为一个恒等函数，而后应用 get_fused_bn_to_conv_state_dict 将其与 identity bn 交融。还是通过在对应的内核核心为对应的通道的权重设置成 1 来实现。

Conv 的权重是 in_channels, out_channels, kernel_h, kernel_w。如果咱们要创立一个 identity，conv(x) = x，我只须要将权重设为 1 即可, 代码如下：

with torch.no_grad():
    x = torch.randn((1,2,3,3))
    identity_conv = nn.Conv2d(2,2,kernel_size=3, padding=1, bias=False)
    identity_conv.weight.zero_()
    print(identity_conv.weight.shape)

    in_channels = identity_conv.in_channels
    for i in range(in_channels):
        identity_conv.weight[i, i % in_channels, 1, 1] = 1

    print(identity_conv.weight)
    
    out = identity_conv(x)
    assert torch.allclose(x, out)

后果

torch.Size([2, 2, 3, 3])
Parameter containing:
tensor([[[[0., 0., 0.],
          [0., 1., 0.],
          [0., 0., 0.]],         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]]],
        [[[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],         [[0., 0., 0.],
          [0., 1., 0.],
          [0., 0., 0.]]]], requires_grad=True)

咱们创立了一个 Conv，它的作用就像一个恒等函数。把所有的货色放在一起，就是论文中的参数重塑。

def get_fused_conv_state_dict_from_block(block: RepVGGBlock) -> Dict[str, Tensor]:
    fused_block_conv_state_dict = get_fused_bn_to_conv_state_dict(block.block[0], block.block[1]
    )

    if block.shortcut:
        # fuse the 1x1 shortcut
        conv_1x1_state_dict = get_fused_bn_to_conv_state_dict(block.shortcut[0], block.shortcut[1]
        )
        # we pad the 1x1 to a 3x3
        conv_1x1_state_dict["weight"] = torch.nn.functional.pad(conv_1x1_state_dict["weight"], [1, 1, 1, 1]
        )
        fused_block_conv_state_dict["weight"] += conv_1x1_state_dict["weight"]
        fused_block_conv_state_dict["bias"] += conv_1x1_state_dict["bias"]
    if block.identity:
        # create our identity 3x3 conv kernel
        identify_conv = nn.Conv2d(block.block[0].in_channels,
            block.block[0].in_channels,
            kernel_size=3,
            bias=True,
            padding=1,
        ).to(block.block[0].weight.device)
        # set them to zero!
        identify_conv.weight.zero_()
        # set the middle element to zero for the right channel
        in_channels = identify_conv.in_channels
        for i in range(identify_conv.in_channels):
            identify_conv.weight[i, i % in_channels, 1, 1] = 1
        # fuse the 3x3 identity
        identity_state_dict = get_fused_bn_to_conv_state_dict(identify_conv, block.identity)
        fused_block_conv_state_dict["weight"] += identity_state_dict["weight"]
        fused_block_conv_state_dict["bias"] += identity_state_dict["bias"]

    fused_conv_state_dict = {k: nn.Parameter(v) for k, v in fused_block_conv_state_dict.items()}

    return fused_conv_state_dict

最初定义一个 RepVGGFastBlock。它只是由 conv + relu 组成

class RepVGGFastBlock(nn.Sequential):
    def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1)
        self.relu = nn.ReLU(inplace=True)

并在 RepVGGBlock 中增加 to_fast 办法来疾速创立 RepVGGFastBlock

class RepVGGBlock(nn.Module):
    def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
        super().__init__()
        self.block = Conv2dNormActivation(
            in_channels,
            out_channels,
            kernel_size=3,
            padding=1,
            bias=False,
            stride=stride,
            activation_layer=None,
            # the original model may also have groups > 1
        )

        self.shortcut = Conv2dNormActivation(
            in_channels,
            out_channels,
            kernel_size=1,
            stride=stride,
            activation_layer=None,
        )

        self.identity = (nn.BatchNorm2d(out_channels) if in_channels == out_channels else None
        )

        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        res = x  # <- 2x memory
        x = self.block(x)
        x += self.shortcut(res)
        if self.identity:
            x += self.identity(res)
        x = self.relu(x)  # <- 1x memory
        return x

    def to_fast(self) -> RepVGGFastBlock:
        fused_conv_state_dict = get_fused_conv_state_dict_from_block(self)
        fast_block = RepVGGFastBlock(self.block[0].in_channels,
            self.block[0].out_channels,
            stride=self.block[0].stride,
        )

        fast_block.conv.load_state_dict(fused_conv_state_dict)

        return fast_block

5、RepVGG

switch_to_fast 办法来定义 RepVGGStage(块的汇合)和 RepVGG:

class RepVGGStage(nn.Sequential):
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        depth: int,
    ):
        super().__init__(RepVGGBlock(in_channels, out_channels, stride=2),
            *[RepVGGBlock(out_channels, out_channels) for _ in range(depth - 1)],
        )

class RepVGG(nn.Sequential):
    def __init__(self, widths: List[int], depths: List[int], in_channels: int = 3):
        super().__init__()
        in_out_channels = zip(widths, widths[1:])

        self.stages = nn.Sequential(RepVGGStage(in_channels, widths[0], depth=1),
            *[RepVGGStage(in_channels, out_channels, depth)
                for (in_channels, out_channels), depth in zip(in_out_channels, depths)
            ],
        )

        # omit classification head for simplicity

    def switch_to_fast(self):
        for stage in self.stages:
            for i, block in enumerate(stage):
                stage[i] = block.to_fast()
        return self

这样就实现了，上面咱们看看测试

6、模型测试

benchmark.py 中曾经创立了一个基准，在 gtx 1080ti 上运行不同批处理大小的模型，这是后果:

模型每个阶段有两层，四个阶段，宽度为 64,128,256,512。

在他们的论文中，他们将这些值按肯定的比例缩放(称为 a 和 b)，并应用分组卷积。因为对从新参数化局部更感兴趣，所以这里跳过了，因为这是一个调参的过程，能够应用超参数搜寻的办法得出。

基本上重塑参数的模型与一般模型相比在不同的时间尺度上晋升的还是很显著的

能够看到，对于 batch_size=128，默认模型 (多分支) 占用 1.45 秒，而参数化模型 (疾速) 只占用 0.0134 秒。即 108 倍的晋升

在本文中，首先具体的介绍了 RepVGG 的论文，而后逐渐理解了如何创立 RepVGG，并且着重介绍了重塑权重的办法，并且用 Pytorch 复现了论文的模型，RepVGG 这种重塑权重技术其实就是应用了过河拆桥的办法，白嫖了多分支的性能，并且还可能晋升，你说气不气人。这种“白嫖”的技术也能够移植到其余架构中。

代码在这里：

https://avoid.overfit.cn/post/f9263685607b40df80e5c4f949a28b42
谢谢浏览!

论文详解

Pytorch 实现 RepVGG

总结