ConvNext论文提出了一种新的基于卷积的架构,不仅超过了基于 Transformer 的模型(如 Swin),而且能够随着数据量的减少而扩大!明天咱们应用Pytorch来对其进行复现。下图显示了针对不同数据集/模型大小的 ConvNext 准确度。
作者首先采纳家喻户晓的 ResNet 架构,并依据过来十年中的新最佳实际和发现对其进行迭代改良。作者专一于 Swin-Transformer,并亲密关注其设计。这篇论文咱们在以前也举荐过,如果你们有浏览过,咱们强烈推荐浏览它:)
下图显示了所有各种改良以及每一项改良之后的各自性能。
论文将设计的路线图分为两局部:宏观设计和宏观设计。宏观设计是从高层次的角度所做的所有扭转,例如架构的设计,而微设计更多的是对于细节的,例如激活函数,归一化等。
上面咱们将从一个经典的 BottleNeck 块开始,并应用pytorch一一实现论文中说到的每个更改。
从ResNet开始
ResNet 由一个一个的残差(BottleNeck) 块,咱们就从这里开始。
fromtorchimportnnfromtorchimportTensorfromtypingimportListclassConvNormAct(nn.Sequential): """ A little util layer composed by (conv) -> (norm) -> (act) layers. """ def__init__( self, in_features: int, out_features: int, kernel_size: int, norm = nn.BatchNorm2d, act = nn.ReLU, **kwargs ): super().__init__( nn.Conv2d( in_features, out_features, kernel_size=kernel_size, padding=kernel_size//2, **kwargs ), norm(out_features), act(), )classBottleNeckBlock(nn.Module): def__init__( self, in_features: int, out_features: int, reduction: int = 4, stride: int = 1, ): super().__init__() reduced_features = out_features//reduction self.block = nn.Sequential( # wide -> narrow ConvNormAct( in_features, reduced_features, kernel_size=1, stride=stride, bias=False ), # narrow -> narrow ConvNormAct(reduced_features, reduced_features, kernel_size=3, bias=False), # narrow -> wide ConvNormAct(reduced_features, out_features, kernel_size=1, bias=False, act=nn.Identity), ) self.shortcut = ( nn.Sequential( ConvNormAct( in_features, out_features, kernel_size=1, stride=stride, bias=False ) ) ifin_features!= out_features elsenn.Identity() ) self.act = nn.ReLU() defforward(self, x: Tensor) ->Tensor: res = x x = self.block(x) res = self.shortcut(res) x += res x = self.act(x) returnx
看看下面代码是否无效
importtorchx = torch.rand(1, 32, 7, 7)block = BottleNeckBlock(32, 64)block(x).shape#torch.Size([1, 64, 7, 7])
上面开始定义Stage,Stage也叫阶段是残差块的汇合。每个阶段通常将输出下采样 2 倍
classConvNexStage(nn.Sequential): def__init__( self, in_features: int, out_features: int, depth: int, stride: int = 2, **kwargs ): super().__init__( # downsample is done here BottleNeckBlock(in_features, out_features, stride=stride, **kwargs), *[ BottleNeckBlock(out_features, out_features, **kwargs) for_inrange(depth-1) ], )
测试
stage = ConvNexStage(32, 64, depth=2)stage(x).shape#torch.Size([1, 64, 4, 4])
咱们曾经将输出是从 7x7 缩小到 4x4 。
ResNet 也有所谓的 stem,这是模型中对输出图像进行大量下采样的第一层。
classConvNextStem(nn.Sequential): def__init__(self, in_features: int, out_features: int): super().__init__( ConvNormAct( in_features, out_features, kernel_size=7, stride=2 ), nn.MaxPool2d(kernel_size=3, stride=2, padding=1), )
当初咱们能够定义 ConvNextEncoder 来拼接各个阶段,并将图像作为输出生成最终嵌入。
classConvNextEncoder(nn.Module): def__init__( self, in_channels: int, stem_features: int, depths: List[int], widths: List[int], ): super().__init__() self.stem = ConvNextStem(in_channels, stem_features) in_out_widths = list(zip(widths, widths[1:])) self.stages = nn.ModuleList( [ ConvNexStage(stem_features, widths[0], depths[0], stride=1), *[ ConvNexStage(in_features, out_features, depth) for (in_features, out_features), depthinzip( in_out_widths, depths[1:] ) ], ] ) defforward(self, x): x = self.stem(x) forstageinself.stages: x = stage(x) returnx
测试后果如下:
image = torch.rand(1, 3, 224, 224)encoder = ConvNextEncoder(in_channels=3, stem_features=64, depths=[3,4,6,4], widths=[256, 512, 1024, 2048])encoder(image).shape#torch.Size([1, 2048, 7, 7])
当初咱们实现了 resnet50 编码器,如果你附加一个分类头,那么他就能够在图像分类工作上工作。上面开始进入本文的正题实现ConvNext。
Macro Design
1、扭转阶段计算比率
传统的ResNet 中蕴含了 4 个阶段,而Swin Transformer这4个阶段应用的比例为1:1:3:1(第一个阶段有一个区块,第二个阶段有一个区块,第三个阶段有三个区块……)将ResNet50调整为这个比率((3,4,6,3)->(3,3,9,3))能够使性能从78.8%进步到79.4%。
encoder = ConvNextEncoder(in_channels=3, stem_features=64, depths=[3,3,9,3], widths=[256, 512, 1024, 2048])
2、将stem改为“Patchify”
ResNet stem应用的是十分激进的7x7和maxpool来大量采样输出图像。然而,Transfomers 应用了 被称为“Patchify”的骨干,这意味着他们将输出图像嵌入到补丁中。Vision transforms应用十分激进的补丁(16x16),而ConvNext的作者应用应用conv层实现的4x4补丁,这使得性能从79.4%晋升到79.5%。
classConvNextStem(nn.Sequential): def__init__(self, in_features: int, out_features: int): super().__init__( nn.Conv2d(in_features, out_features, kernel_size=4, stride=4), nn.BatchNorm2d(out_features) )
3、ResNeXtify
ResNetXt 对 BottleNeck 中的 3x3 卷积层采纳分组卷积来缩小 FLOPS。在 ConvNext 中应用depth-wise convolution(如 MobileNet 和起初的 EfficientNet)。depth-wise convolution也是是分组卷积的一种模式,其中组数等于输出通道数。
作者留神到这与 self-attention 中的加权求和操作十分类似,后者仅在空间维度上混合信息。应用 depth-wise convs 会升高精度(因为没有像 ResNetXt 那样减少宽度),这是意料之中的毕竟晋升了速度。
所以咱们将 BottleNeck 块内的 3x3 conv 更改为上面代码
ConvNormAct(reduced_features, reduced_features, kernel_size=3, bias=False, groups=reduced_features)
4、Inverted Bottleneck(倒置瓶颈)
个别的 BottleNeck 首先通过 1x1 conv 缩小特色,而后用 3x3 conv,最初将特色扩大为原始大小,而倒置瓶颈块则相同。
所以上面咱们从宽 -> 窄 -> 宽 批改到到 窄 -> 宽 -> 窄。
这与 Transformer 相似,因为 MLP 层遵循窄 -> 宽 -> 窄设计,MLP 中的第二个浓密层将输出的特色扩大了四倍。
classBottleNeckBlock(nn.Module): def__init__( self, in_features: int, out_features: int, expansion: int = 4, stride: int = 1, ): super().__init__() expanded_features = out_features*expansion self.block = nn.Sequential( # narrow -> wide ConvNormAct( in_features, expanded_features, kernel_size=1, stride=stride, bias=False ), # wide -> wide (with depth-wise) ConvNormAct(expanded_features, expanded_features, kernel_size=3, bias=False, groups=in_features), # wide -> narrow ConvNormAct(expanded_features, out_features, kernel_size=1, bias=False, act=nn.Identity), ) self.shortcut = ( nn.Sequential( ConvNormAct( in_features, out_features, kernel_size=1, stride=stride, bias=False ) ) ifin_features!= out_features elsenn.Identity() ) self.act = nn.ReLU() defforward(self, x: Tensor) ->Tensor: res = x x = self.block(x) res = self.shortcut(res) x += res x = self.act(x) returnx
5、扩充卷积核大小
像Swin一样,ViT应用更大的内核尺寸(7x7)。减少内核的大小会使计算量更大,所以才应用下面提到的depth-wise convolution,通过应用更少的通道来缩小计算量。作者指出,这相似于 Transformers 模型,其中多头自我留神 (MSA) 在 MLP 层之前实现。
classBottleNeckBlock(nn.Module): def__init__( self, in_features: int, out_features: int, expansion: int = 4, stride: int = 1, ): super().__init__() expanded_features = out_features*expansion self.block = nn.Sequential( # narrow -> wide (with depth-wise and bigger kernel) ConvNormAct( in_features, in_features, kernel_size=7, stride=stride, bias=False, groups=in_features ), # wide -> wide ConvNormAct(in_features, expanded_features, kernel_size=1), # wide -> narrow ConvNormAct(expanded_features, out_features, kernel_size=1, bias=False, act=nn.Identity), ) self.shortcut = ( nn.Sequential( ConvNormAct( in_features, out_features, kernel_size=1, stride=stride, bias=False ) ) ifin_features!= out_features elsenn.Identity() ) self.act = nn.ReLU() defforward(self, x: Tensor) ->Tensor: res = x x = self.block(x) res = self.shortcut(res) x += res x = self.act(x) returnx
这将准确度从 79.9% 进步到 80.6%
Micro Design
1、用 GELU 替换 ReLU
transformers应用的是GELU,为什么咱们不必呢?作者测试替换后准确率放弃不变。PyTorch 的GELU 是 在 nn.GELU。
2、更少的激活函数
残差块有三个激活函数。而在Transformer块中,只有一个激活函数,即MLP块中的激活函数。作者除去了除中间层之后的所有激活。这是与swing - t一样的,这使得精度进步到81.3% !
3、更少的归一化层
与激活相似,Transformers 块具备较少的归一化层。作者决定删除所有 BatchNorm,只保留两头转换之前的那个。
4、用 LN 代替 BN
作者用 LN代替了 BN层。他们留神到在原始 ResNet 中提到这样做会侵害性能,但通过作者以上的所有的更改后,性能进步到 81.5%
下面4个步骤让咱们整合起来操作:
classBottleNeckBlock(nn.Module): def__init__( self, in_features: int, out_features: int, expansion: int = 4, stride: int = 1, ): super().__init__() expanded_features = out_features*expansion self.block = nn.Sequential( # narrow -> wide (with depth-wise and bigger kernel) nn.Conv2d( in_features, in_features, kernel_size=7, stride=stride, bias=False, groups=in_features ), # GroupNorm with num_groups=1 is the same as LayerNorm but works for 2D data nn.GroupNorm(num_groups=1, num_channels=in_features), # wide -> wide nn.Conv2d(in_features, expanded_features, kernel_size=1), nn.GELU(), # wide -> narrow nn.Conv2d(expanded_features, out_features, kernel_size=1), ) self.shortcut = ( nn.Sequential( ConvNormAct( in_features, out_features, kernel_size=1, stride=stride, bias=False ) ) ifin_features!= out_features elsenn.Identity() ) defforward(self, x: Tensor) ->Tensor: res = x x = self.block(x) res = self.shortcut(res) x += res returnx
拆散下采样层
在 ResNet 中,下采样是通过 stride=2 conv 实现的。Transformers(以及其余卷积网络)也有一个独自的下采样模块。作者删除了 stride=2 并在三个 conv 之前增加了一个下采样块,为了放弃训练期间的稳定性在,在下采样操作之前须要进行归一化。将此模块增加到 ConvNexStage。达到了超过 Swin 的 82.0%!
classConvNexStage(nn.Sequential): def__init__( self, in_features: int, out_features: int, depth: int, **kwargs ): super().__init__( # add the downsampler nn.Sequential( nn.GroupNorm(num_groups=1, num_channels=in_features), nn.Conv2d(in_features, out_features, kernel_size=2, stride=2) ), *[ BottleNeckBlock(out_features, out_features, **kwargs) for_inrange(depth) ], )
当初咱们失去了最终的 BottleNeckBlock层代码:
classBottleNeckBlock(nn.Module): def__init__( self, in_features: int, out_features: int, expansion: int = 4, ): super().__init__() expanded_features = out_features*expansion self.block = nn.Sequential( # narrow -> wide (with depth-wise and bigger kernel) nn.Conv2d( in_features, in_features, kernel_size=7, padding=3, bias=False, groups=in_features ), # GroupNorm with num_groups=1 is the same as LayerNorm but works for 2D data nn.GroupNorm(num_groups=1, num_channels=in_features), # wide -> wide nn.Conv2d(in_features, expanded_features, kernel_size=1), nn.GELU(), # wide -> narrow nn.Conv2d(expanded_features, out_features, kernel_size=1), ) defforward(self, x: Tensor) ->Tensor: res = x x = self.block(x) x += res returnx
让咱们测试一下最终的stage代码
stage = ConvNexStage(32, 62, depth=1)stage(torch.randn(1, 32, 14, 14)).shape#torch.Size([1, 62, 7, 7])
最初的一些丢该
论文中还增加了Stochastic Depth,也称为 Drop Path还有 Layer Scale。
fromtorchvision.opsimportStochasticDepthclassLayerScaler(nn.Module): def__init__(self, init_value: float, dimensions: int): super().__init__() self.gamma = nn.Parameter(init_value*torch.ones((dimensions)), requires_grad=True) defforward(self, x): returnself.gamma[None,...,None,None] *xclassBottleNeckBlock(nn.Module): def__init__( self, in_features: int, out_features: int, expansion: int = 4, drop_p: float = .0, layer_scaler_init_value: float = 1e-6, ): super().__init__() expanded_features = out_features*expansion self.block = nn.Sequential( # narrow -> wide (with depth-wise and bigger kernel) nn.Conv2d( in_features, in_features, kernel_size=7, padding=3, bias=False, groups=in_features ), # GroupNorm with num_groups=1 is the same as LayerNorm but works for 2D data nn.GroupNorm(num_groups=1, num_channels=in_features), # wide -> wide nn.Conv2d(in_features, expanded_features, kernel_size=1), nn.GELU(), # wide -> narrow nn.Conv2d(expanded_features, out_features, kernel_size=1), ) self.layer_scaler = LayerScaler(layer_scaler_init_value, out_features) self.drop_path = StochasticDepth(drop_p, mode="batch") defforward(self, x: Tensor) ->Tensor: res = x x = self.block(x) x = self.layer_scaler(x) x = self.drop_path(x) x += res returnx
好了,当初咱们看看最终后果
stage = ConvNexStage(32, 62, depth=1)stage(torch.randn(1, 32, 14, 14)).shape#torch.Size([1, 62, 7, 7])
最初咱们批改一下Drop Path的概率
classConvNextEncoder(nn.Module): def__init__( self, in_channels: int, stem_features: int, depths: List[int], widths: List[int], drop_p: float = .0, ): super().__init__() self.stem = ConvNextStem(in_channels, stem_features) in_out_widths = list(zip(widths, widths[1:])) # create drop paths probabilities (one for each stage) drop_probs = [x.item() forxintorch.linspace(0, drop_p, sum(depths))] self.stages = nn.ModuleList( [ ConvNexStage(stem_features, widths[0], depths[0], drop_p=drop_probs[0]), *[ ConvNexStage(in_features, out_features, depth, drop_p=drop_p) for (in_features, out_features), depth, drop_pinzip( in_out_widths, depths[1:], drop_probs[1:] ) ], ] ) defforward(self, x): x = self.stem(x) forstageinself.stages: x = stage(x) returnx
测试:
image = torch.rand(1, 3, 224, 224)encoder = ConvNextEncoder(in_channels=3, stem_features=64, depths=[3,4,6,4], widths=[256, 512, 1024, 2048])encoder(image).shape#torch.Size([1, 2048, 3, 3])
ConvNext的特色,咱们须要在编码器顶部利用分类头。咱们还在最初一个线性层之前增加了一个 LayerNorm。
classClassificationHead(nn.Sequential): def__init__(self, num_channels: int, num_classes: int = 1000): super().__init__( nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(1), nn.LayerNorm(num_channels), nn.Linear(num_channels, num_classes) ) classConvNextForImageClassification(nn.Sequential): def__init__(self, in_channels: int, stem_features: int, depths: List[int], widths: List[int], drop_p: float = .0, num_classes: int = 1000): super().__init__() self.encoder = ConvNextEncoder(in_channels, stem_features, depths, widths, drop_p) self.head = ClassificationHead(widths[-1], num_classes)
最终模型测试:
image = torch.rand(1, 3, 224, 224)classifier = ConvNextForImageClassification(in_channels=3, stem_features=64, depths=[3,4,6,4], widths=[256, 512, 1024, 2048])classifier(image).shape#torch.Size([1, 1000])
最初总结
在本文中复现了作者应用ResNet 创立ConvNext 的所有过程。 如果你想须要残缺代码,能够查看这个地址:
https://avoid.overfit.cn/post/1fd17e7520134996b532ecd50de9672f
作者:Francesco Zuppichini