乐趣区

关于机器学习:自标签-SelfLabeled-介绍

本文将对自标签(self-labeled)作简要介绍,次要包含定义和分类。其中定义给出中英文对照。文章参考自 [1]。

定义

首先是对这类办法的定义,如下图所示

Semi-supervised learning (SSL):

联合监督学习和无监督学习来给模式识别提供额定信息。
An extension of unsupervised and supervised learning by including additional information typical of the other learning paradigm.

SSL 分为以下两类:

  • Semi-supervised classification (SS-Cla):

    关注于半监督分类问题

  • Semi-supervised clustering (SS-Clu):

    关注于半监督聚类问题

Self-Labeled 办法 关注 (SS-Cla),即分类问题。

Self-Labeled Method:

自标签办法个别指通过标注无标签样本来裁减数据集 (EL)。
These techniques aim to obtain one (or several) enlarged labeled set(s) (EL), based on their most confident predictions, to classify unlabeled data.

  • Self-training:

    利用带标注样本训练一个分类器,给无标签样本标注。而后应用置信度高的无标签标注样本裁减数据集 EL 来 retrain 模型。
    A classifier is trained with an initial small number of labeled examples, aiming to classify unlabeled points. Then it is retrained with its own most confident predictions, enlarging its labeled training set. This model does not make any specific assumptions for the input data, but it accepts that its own predictions tend to be correct.

  • Co-training:

    训练多个分类器,各个分类器相互用各自的置信度高的样本学习。
    It trains one classifier in each specific view, and then the classifiers teach each other the most confidently predicted examples. Multi-view learning for SSC is usually understood to be a generalization of co-training.

分类

依据 Addition mechanism:

抉择假样本的抉择形式。

  • Incremental:

    从 EL= L 开始,一直抉择最 confident 的样本。
    长处:速度快。
    毛病:抉择到假标签打错的样本。

  • Batch:

    制订某种减少规定,抉择合乎这种规范的样本退出训练集。跟 Incremental 的区别是 Incremental 抉择现训练阶段分类器预测置信度高的样本,给样本打上确定类别的标签,而 Batch 在训练阶段不给无监督样本打上确定类别标签。

  • Amending:

    从 EL= L 开始,一直抉择或删除样本。可提供纠正能力。

依据 Single-learning versus multi-learning:

  • single-learning:预测由繁多分类算法 / 分类器给出。

  • multi-learning:预测由分类器给出。

依据 Single-view versus multi-view:

样本的特色(具备齐备的条件信息)示意称为一个 view。

  • multi-view

  • single-view

依据 Confidence measures:

如何定义置信度(Confidence)

  • Simple

    通过计算样本的概率。

  • Agreement and combination

    多分类器的预测联合或应用混合模型来计算。

依据 Self-teaching versus mutual-teaching:

  • mutual-teaching:每种分类器相互提供各自的 EL。

  • Self-teaching:每种分类器应用各自的 EL。

依据 Stopping criteria:

  • 抉择选集

    传统的办法给所有无监督样本打上假标签。但这会引入较多谬误标注的样本。

  • 抉择局部

    抉择局部样本,但需事后定义抉择的迭代次数和受数据集大小影响。

  • 假如不变

    当抉择的样本不扭转假如(分类器)进行。

[1] Triguero, I. et al.“Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study.”Knowledge and Information Systems 42 (2013): 245-284.

退出移动版