关于人工智能:长尾分布之DECOUPLING-REPRESENTATION-AND-CLASSIFIER

原始文档：https://www.yuque.com/lart/pa…

ICLR 2020 的文章.

针对长尾散布的分类问题提出了一种简略无效的基于 re-sample 范式的策略.

提出的办法将模型的学习过程拆分成两局部:_representation learning_ 和 _classification_.

对于前者, 则将残缺的模型在原始的数据分布上进行训练, 即 instance-balanced (natural) sampling, 从而学习_the best and most generalizable representations_. 训练好后, 再额定调整模型的分类器(_retraining the classifier with class-balanced sampling or by a simple, yet effective, classifier weight normalization which has only a single hyperparameter controlling the “temperature” and which does not require additional training_).

在这份工作中, 作者们证实了在长尾场景中, 将这种拆散 (separation) 能够更加间接的取得好的检测性能, 而不须要设计采样策略、均衡损失或者是增加 memory 模块.

依照这里 https://zhuanlan.zhihu.com/p/158638078 总结的:

对任何不平衡分类数据集地再均衡实质都应该只是对分类器地再平衡, 而不应该用类别的散布扭转特色学习时图片特色的散布, 或者说图片特色的散布和类别标注的散布, 实质上是不耦合的.

论文中对于相干工作的介绍十分粗疏和全面, 进行了较为残缺的梳理.
现有的钻研次要能够划分为三个方向:

Data distribution re-balancing. Re-sample the dataset to achieve a more balanced data distribution.
- 过采样, over-sampling for the minority classes (by adding copies of data)
- 欠采样, undersampling for the majority classes (by removing data)
- 类均衡采样, class-balanced sampling based on the number of samples for each class
Class-balanced Losses. Assign different losses to different training samples for each class.
- The loss can vary at class-level for matching a given data distribution and improving the generalization of tail classes.
- A more fine-grained control of the loss can also be achieved at sample level, e.g. with Focal loss, Meta-Weight-Net, re-weighted training, or based on Bayesian uncertainty.
- To balance the classification regions of head and tail classes using an affinity measure to enforce cluster centers of classes to be uniformly spaced and equidistant.
Transfer learning from head to tail classes. Transferring features learned from head classes with abundant training instances to under-represented tail classes.
- Recent work includes transferring the intra-class variance and transferring semantic deep features. However it is usually a non-trivial task to design specific modules (e.g. external memory) for feature transfer.

并且也补充了和最近的一个对于少样本辨认 (low-shot recognition) 的基准办法的比拟:

少样本辨认, 他们蕴含着一个表征学习阶段, 这个阶段中不能解决 (without access to) 少样本类. 后续会有少样本学习阶段.
与其不同, 长尾散布辨认的设定假如能够拜访头部和尾部类别, 并且类别标签的缩小更加间断.

在长尾辨认中, 训练集在所有的类上整体遵循着一个长尾散布. 在训练过程中, 对于一些不常见的类数据量很小, 应用这样的不均衡的数据集训练失去的模型趋向于在小样本类上欠拟合. 然而实际中, 对于所有类都能良好辨认的模型才是咱们须要的. 因而不同针对少样本的重采样策略、损失重加权和边界正则化 (margin regularization) 办法被提出. 然而, 目前尚不分明它们如何实现长尾辨认的性能晋升 (如果有的话).
本文将会系统地通过将表征学习过程与分类器学习过程拆散的形式探索他们的有效性, 来辨认什么对于长尾散布的确重要.
首先明确相干的符号示意:

$X={x_i, y_i}, i \in {1, \dots, n}$示意训练集, 其中的 $y_i$示意对于数据点 $x_i$对应的标签.
$n_j$示意对于类别 $j$对应的训练样本的数量, 而 $n = \Sigma^{c}_{j=1} n_j$示意总的训练样本数.
不是一般性, 这里将所有类依照各自的样本数, 即其容量来降序排序, 即, 如果 $i<j$, 则有 $n_i \ge n_j$. 另外因为长尾的设定, 所以 $n_1 \gg n_C$, 即头部类远大于尾部类.
$f(x; \theta) = z$示意对于输出数据的表征, 这里 $f(x; \theta)$通过参数为 $\theta$的 CNN 模型实现.
最终的类别预测 $\tilde{y}$由分类器函数 $g$给出, 即 $\tilde{y} = \text{argmax}\, g(z)$. 个别状况下 $g$就是一个线性分类器, 即 $g(z) = \mathbf{W}^\top z + \mathbf{b}$. 这里的 $\mathbf{W} \& \mathbf{b}$别离示意权重矩阵和偏置参数. 当然, 文章中也探讨了一些其余模式的 $g$.

这旨在均衡表征学习与分类器学习的数据分布. 大多数采样策略都能够被对立示意成如下模式. 即采样一个数据点, 它来自于类别 $j$的概率 $p_j$能够被示意为:$p_j = \frac{n^q_j}{\Sigma^C_{i=1}n_i^q}$. 留神, 这里是基于类进行的示意, 实际上对于每个独自的数据而言, 他们的采样过程能够看作是一个两阶段的过程, 即先对 $C$个类进行自定义采样, 再对类外部的数据平均采样. 这里的蕴含了一个参数 $q \in [0, 1]$. 用以调制不同类的采样概率, 依据它的不同取值, 从而能够划分为多种情景:

Instance-balanced sampling: 这是最通常的采样数据的形式, 每个训练样本都是等概率被抉择. 此时 $q=1$. 来自特定类别的数据点被采样的概率 $p^{IB}$成比例与该类别的容量.
Class-balanced sampling: 对于不均衡的数据集, Instance-balanced sampling 是次优的, 因为模型会欠拟合少样本类, 导致更低的准确率, 尤其是对于均衡的测试集. 而 Class-balanced sampling 曾经被用来缓解这一差别. 在这种状况下, 每个类别会被强制等概率的被抉择. 此时有 $q = 0$, 即间接抹平了类内数据量的影响. 所有类都有 $p^{CB} = 1/C$. 对于理论中, 该策略能够可看作是两阶段采样过程, 第一步各个类被从类别汇合中平均采样, 第二部, 类内样本被平均采样.
Square-root sampling: 一些其余的采样策略略同样被摸索, 广泛应用的变体是平方根采样, 这时 $q=1/2$.
- Typically, a class-balanced loss assigns sample weights inversely proportionally to the class frequency. This simple heuristic method has been widely adopted. However, recent work on training from large-scale, real-world, long-tailed datasets reveals poor performance when using this strategy. Instead, they use a “smoothed” version of weights that are empirically set to be inversely proportional to the square root of class frequency. (来自_Class-Balanced Loss Based on Effective Number of Samples_)
Progressively-lalanced sampling: 最近一些办法尝试将后面的策略进行组合, 从而实现了混合采样策略. 实际中, 先在一些的 epoch 中应用实例均衡采样, 之后在残余的 epoch 中切换为类均衡采样. 这些混合采样策略须要设置切换的工夫点, 这引入了胃癌的超参数. 在本文中, 应用了一个 ” 软化 ” 版本, 即渐进式均衡采样. 通过应用一个随着训练 epoch 进度一直调整的插值参数来线性加权 IB 与 CB 的类别采样概率. 所以有 $p^{PB}_j(t) = (1 – \frac{t}{T}) p_j^{IB} + \frac{t}{T} p_j^{CB}$. 这里的 $T$示意总的 epoch 数量.

作者基于 ImageNet-LT 的数据绘制了采样权重的比例图:

论文_Class-Balanced Loss Based on Effective Number of Samples_中的如下内容很好的阐明了重采样存在的问题:

Inthe context of deep feature representation learning using CNNs, re-sampling may either introduce large amounts of duplicated samples, which slows down the training and makes the model susceptible to overfitting when oversampling, or discard valuable examples that are important for feature learning when under-sampling.

这部分内容实际上和本文的探讨相关性并不大, 所以作者们并没有太具体的梳理.

此外, 咱们发现一些报告高性能的最新办法很难训练和重现, 并且在许多状况下 须要宽泛的、特定于数据集的超参数调整.

文章的试验表明装备适当的均衡的分类器的基线办法能够 比最新的损失重加权的办法如果不是更好, 也最起码是同样好.

文章比拟的一些最新的相干办法:

Focal Loss: 针对指标检测工作提出. 通过减低简略样本的损失权重, 来均衡样本级别的分类损失. 它对对应于类别 $y_i$的样本 $x_i$的概率预测 $h_i$增加了一个重加权因子 $(1 – h_i)^{\gamma}, \gamma > 0$, 来调整规范的穿插熵损失:$\mathcal{L}_{\text{focal}} := (1 – h_i)^\gamma \mathcal{L}_{\text{CE}} = -(1 – h_i)^\gamma \text{log}(h_i)$. 整体的作用就是对于有着大的预测概率的简略样本施加更小的权重, 对于有着较小预测概率的艰难样本施加更大的权重.
Focal Loss 的类均衡变体: 对于来自类别 $j
$ 的样本应用类均衡系数来加权. 这能够用来替换原始 FocalLoss 中的 alpha 参数. 所以该办法 (解析可见:https://www.cnblogs.com/wanghui-garcia/p/12193562.html) 能够看做一个基于无效样本数量的概念的根底上, 明确地在 focal loss 中设置 alpha 的形式. (_Class-Balanced Loss Based on Effective Number of Samples:_ https://github.com/richardaecn/class-balanced-loss)
Label-distribution-aware margin(LDAM)loss (https://arxiv.org/pdf/1906.07413.pdf): 激励少样本类由更大的边界, 并且他们的最终的损失模式能够示意为一个有着强制边界的穿插熵损失: $\mathcal{L}_{\text{LDAM}} := -\log\frac{e^{\hat{y}_j – \Delta_j}}{e^{\hat{y}_j – \Delta_j} + \Sigma_{c \ne j} e^{\hat{y}_c}}$. 这里的 $\hat{y}$是 logits, 而 $\Delta_j \propto \frac{1}{n_j^{1/4}}$是 class-aware margin. (对于 softmax 损失的 margin 的一些介绍:_Softmax 了解之 margin – 王峰的文章 – 知乎_https://zhuanlan.zhihu.com/p/52108088)

当在均衡数据集上学习分类模型的时候, 分类器被和用于提取表征的模型通过穿插熵损失一起联结训练. 这实际上也是一个长尾辨认工作的典型的基线设定. 只管存在不同的例如重采样、重加权或者迁徙表征的办法被提出, 然而广泛的范式依然统一, 即分类器要么与表征学习一起端到端联结学习, 要么通过两阶段办法, 其中第二阶段里, 分类器和表征学习通过类均衡采样的变体进行联结微调.

本文将表征学习从分类中分离出来, 来应答长尾辨认.

所以上面展现了一些文中用到的学习分类器的办法, 旨在改正对于头部和尾部类的决策边界, 这次要通过应用不同的采样策略, 或者其余的无参办法 (例如最近邻类别均值分类器) 来微调分类器. 同样也思考了一些不须要额定从新训练的办法来从新均衡分类器权重, 这展示了不错的准确率.

Classifier Re-training (cRT). 这是一种间接的办法, 其通过 应用类别均衡采样来从新训练分类器. 即, 放弃表征学习模型固定, 随机从新初始化并且优化分类器权重和偏置, 应用类别均衡采样从新训练大量的 epoch.
Nearest Class Mean classifier (NCM). 另一种罕用的办法是首先在训练集上针对每个类计算均匀特色表征, 在 L2 归一化后的均值特色上, 执行最近邻搜寻, 基于事后类似度, 或者基于欧式间隔. 只管这样的设定很简略, 然而却也是一个很强的 baseline 模型. 在文中的试验中, 余弦类似度通过内含的归一化缓解了权重不均衡问题.
$\tau$-normalized classifier ($\tau$-normalized).
- 这里探索了一种无效的办法来重均衡分类器的决策边界, 这受启发与一种经验性的察看, 即, 在应用实例均衡采样联结训练之后, 权重范数 $||w_j||$与类别容量是相干的. 然而在应用类均衡采样微调分类器之后, 分类器权重的范数趋向于更加类似(从图 2 左侧能够看进去, 类均衡采样微调后的模型的权重范数微调后缺失绝对平缓了许多).
- 受这样的察看的启发, 作者们思考通过 $\tau$-normalization 间接调整分类器的权重范数来修改决策边界的不均衡. 这里让 $\mathbf{W} = {w_j} \in \mathbb{R}^{d \times C}, w_j \in \mathbb{R}^d$, 示意对应于各个类 $j$的分类权重汇合. 这里放缩权重来失去归一化模式:$\tilde{\mathbf{W}} ={\tilde{w}_j}, \tilde{w}_i = \frac{w_i}{||w_i||^{\tau}}$, 这里的 $\tau$是一个归一化温度的超参数. 并且分母应用的是 L2 范数. 当 $\tau = 1$, 分式转化为 L2 归一化, 而当其为 0 时, 没有了归一化解决. 这里经验性的抉择 $\tau \in (0, 1)$, 以至于圈中能够被平滑的修改.
- 在这样的归一化解决之后, 分类的 logits 则能够示意为, 即应用归一化后的线性分类器来解决提取失去的表征 $f(x; \theta)$. 留神这里去掉了偏置项, 因为其对于 logits 和最终的预测的影响是能够疏忽的.
- 这里参数 tau 应用验证集来网格搜寻:_In our submission, tau is determined by grid search on a validation dataset. The search grid is [0.0, 0.1, 0.2, …, 1.0]. We use overall top-1 accuracy to find the best tau on validation set and use that value for test set._
Learnable weight scaling (LWS)另一种解释 $\tau$-normalization 的形式是将其看做一种放弃分类器权重方向的同时, 对权重幅度从新放缩, 这能够被从新表述为:$\tilde{w}_i = f_i * w_i, f_i = \frac{1}{||w_i||^\tau}$. 只管对于 $\tau$-normalization 的超参数能够通过穿插验证来抉择, 然而作者们进一步尝试将放缩因子 $f_i$在训练集上学习, 同时应用类均衡采样. 这样的状况下, 放弃表征和分类器权重都固定, 只学习放缩因子.

来自附录

留神下面提到的几种在第二阶段中调整分类器的策略中, 波及到从新训练和采样策略的只有 cRT 和 LWS, 且都是用的类别均衡重采样. 而 NCM 和$\tau$-normalized 都是不须要思考第二阶段的重采样策略的, 因为他们不须要从新训练.

Places-LT and ImageNet-LT are artificially truncated from their balanced versions (Places-2 (Zhou et al., 2017) and ImageNet-2012 (Deng et al., 2009)) so that the labels of the training set follow a long-tailed distribution.
- Places-LT contains images from 365 categories and the number of images per class ranges from 4980 to 5.
- ImageNet-LT has 1000 classes and the number of images per class ranges from 1280 to 5 images.
iNaturalist 2018 is a real-world, naturally long-tailed dataset, consisting of samples from 8, 142 species.

After training on the long-tailed datasets, we evaluate the models on the corresponding balanced test/validation datasets and report the commonly used top-1 accuracy over all classes, denoted as All.
To better examine performance variations across classes with different number of examples seen during training, we follow Liu et al. (2019) and further report accuracy on three splits of the set of classes: Many-shot (more than 100 images), Medium-shot (20∼100 images) and Few-shot (less than 20 images). Accuracy is reported as a percentage.

We use the PyTorch (Paszke et al., 2017) framework for all experiments.
For Places-LT, we choose ResNet-152 as the backbone network and pretrain it on the full ImageNet-2012 dataset , following Liu et al. (2019).
On ImageNet-LT, we report results with ResNet-{10, 50, 101, 152} (He et al., 2016) and ResNeXt-{50, 101, 152}(32x4d) (Xie et al., 2017) but mainly use ResNeXt-50 for analysis.
Similarly, ResNet-{50, 101, 152} is also used for iNaturalist 2018.
For all experiements, if not specified, we use SGD optimizer with momentum 0.9, batch size 512, cosine learning rate schedule (Loshchilov & Hutter, 2016) gradually decaying from 0.2 to 0 and image resolution 224×224.
In the first representation learning stage, the backbone network is usually trained for 90 epochs. (这里都应用 Instance-balanced sampling for representation learning)
In the second stage, i.e., for retraining a classifier (cRT), we restart the learning rate and train it for 10 epochs while keeping the backbone network fixed.

要留神, 这个图中的采样策略都是在指代表征学习过程中应用的采样策略.

来自附录中的补充图表
For the joint training scheme (Joint), the linear classifier and backbone for representation learning are jointly trained for 90 epochs using a standard cross-entropy loss and different sampling strategies, i.e., Instance-balanced, Class-balanced, Square-root, and Progressively-balanced.
图 1 和表 5 中的 Joint 之间的比照能够看进去:

应用更好的采样策略能够取得更好的性能. 联结训练中不同采样策略的后果验证了试图设计更好的数据采样办法的相干工作的动机.
实例均衡采样对于 many-shot 而言体现的更好, 这是因为最终模型会高度偏差与这些 many-shot 类.

比对图 1, 能够晓得, 这里在第二阶段应用的是 cRT 策略来调整模型

For the decoupled learning schemes, we present results when learning the classifier in the ways, i.e., re-initialize and re-train (cRT), Nearest Class Mean (NCM) as well as τ-normalized classifier.
从图 1 中整体能够看进去:

在大多数状况下, 解耦训练策略都是要好于整体训练的.
甚至无参数的 NCM 策略都体现的不是很差. 其整体性能次要是因为 many-shot 时体现太差而拉了下来.
不须要额定训练或是采样策略的 NCM 和 $\tau$-normalized 策略都体现出了极具竞争力的性能. 他们的杰出体现可能源于他们可能自适应地调整 many-/medium-/few-shot 类的决策边界(如图 4 所示).
在所有解耦办法中, 当波及到整体性能以及除了 many-shot 之外的所有类别拆分时, 咱们都会看到, 实例均衡采样提供了最佳后果 . 这特地乏味, 因为它意味着 数据不均衡可能不是一个会影响学习高质量示意的问题 . 实例均衡采样能够提供最通用的表征.

为了进一步比照, 表 1 中列举了将 backbone 与线性分类器联结微调时模型 (B+ C 和 B +C(0.1xlr))、仅微调 backbone 最初一个 block(LB+C), 或者固定 backbone 而重训练分类器(C) 的几种情景.
表 1 中能够看进去:

微调整个模型性能最差.
固定 backbone, 成果最好(因为是长尾散布的工作, 所以更多关注整体成果和少样本类的成果).
解耦训练的设定, 是很实用于长尾辨认工作的.

在图 2 (左) 中, 咱们凭教训显示了所有分类器的权重向量的 L2 范数, 以及绝对于训练集中的实例数降序排序的训练数据分布.
咱们能够察看到:

联结分类器 (蓝线) 的权重范数与相应类的训练实例数呈正相干.
- more-shot 类偏向于学习具备更大幅度的分类器. 如图 4 所示, 这在特色空间中产生了更宽的分类边界, 容许分类器对数据丰盛的类具备更高的准确性, 但会侵害数据稀缺的类.
τ-normalized 分类器 (金线) 在肯定水平上缓解了这个问题, 它 提供更均衡的分类器权重大小.
对于 re-training 策略 (绿线), 权重简直是均衡的, 除了 few-shot 类有着略微更大的权重范数.
NCM 办法会在图中给出一条水平线, 因为在最近邻搜寻之前均匀向量被 L2 归一化了.
在图 2 (右) 中, 咱们进一步钻研了随着 τ -normalization 分类器的温度参数 τ 的变动时, 性能如何变动. 该图显示 随着 τ 从 0 减少, 多样本类精度急剧下降, 而少样本类精度急剧减少.

$\tau$的抉择: 以后的设置中, 参数 tau 是须要验证集来确定, 这在理论场景中可能是个毛病. 为此, 作者们设计了两种更加自适应的策略:
- 从训练集上寻找 tau: 表 9 中能够看到, 最终在测试汇合上的成果是很靠近的.
  - We achieve this goal by simulating a balanced testing distribution from the training set.
    - We first feed the whole training set through the network to get the top-1 accuracy for each of the classes.
    - Then, we average the class-specific accuracies and use the averaged accuracy as the metric to determine the tau value.
  - As shown in Table 9, we compare the τ found on training set and validation set for all three datasets. We can see that both the value of τ and the overall performances are very close to each other, which demonstrates the effectiveness of searching for τ on training set.
  - This strategy offers a practical way to find τ even when validation set is not available.
- 从训练集上学习 tau: We further investigate if we can automatically learn the τ value instead of grid search.
  - To this end, following cRT, we set τ as a learnable parameter and learn it on the training set with balanced sampling, while keeping all the other parameters fixed (including both the backbone network and classifier).
  - Also, we compare the learned τ value and the corresponding results in the Table 9 (denoted by“learn”= ✓). This further reduces the manual effort of searching best τ values and make the strategy more accessible for practical usage.

MLP 分类器和线性分类器的比拟: We use ReLU as activation function, set the batch size to be 512, and train the MLP using balanced sampling on fixed representation for 10 epochs with a cosine learning rate schedule, which gradually decrease the learning rate to zero.

应用余弦类似度计算来替换线性分类器: We tried to replace the linear classifier with a cosine similarity classifier with (denoted by “cos”) and without (denoted by “cos(noRelu)”) the last ReLU activation function, following [Dynamic few-shot visual learning without forgetting ].

只管抽样策略在独特学习表征和分类器时很重要, 实例均衡采样提供了更多可推广的示意, 在适当地从新均衡分类器之后, 无需精心设计的损失或 memory 单元, 即可实现最先进的性能.

论文:https://arxiv.org/pdf/1910.09217.pdf
代码:https://github.com/facebookresearch/classifier-balancing
《Decoupling Representation and Classifier》笔记 – 千佛山彭于晏的文章 – 知乎 https://zhuanlan.zhihu.com/p/111518894
Long-Tailed Classification (2) 长尾散布下分类问题的最新钻研 – 青磷不可燃的文章 – 知乎 https://zhuanlan.zhihu.com/p/158638078
openreview 页面:https://openreview.net/forum?id=r1gRTCVFvB&noteId=SJx9gIcsoS
基于 GRU 和 am-softmax 的句子类似度模型 – 迷信空间 |Scientific Spaces:https://kexue.fm/archives/5743

关于人工智能:长尾分布之DECOUPLING-REPRESENTATION-AND-CLASSIFIER

背景信息

相干工作

长尾辨认的表征学习

采样策略

损失重加权

长尾辨认的分类学习

试验细节

试验设置

数据集

评估形式

实现细节

具体试验

联结训练时不同采样策略的成果比拟

解耦学习策略的有效性

不同均衡分类器策略的成果比拟

与现有办法的比照

额定的试验

试验小结

参考链接