计算机科学与探索

• 学术研究 •    

自训练新类探测半监督学习算法

何玉林,陈佳琪,黄启航,Philippe Fournier-Viger,黄哲学   

  1. 1.深圳大学 计算机与软件学院,广东 深圳 518060
    2.人工智能与数字经济广东省实验室(深圳),广东 深圳  518107

Self-Training Semi-Supervised Learning Algorithm for New Class Detection

HE Yulin, CHEN Jiaqi, HUANG Qihang, PHILIPPE Fournier-Viger, HUANG Zhexue   

  1. 1.College of Computer Science and Software Engineering,Shenzhen University,Shenzhen 518060,China
    2.Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ),Shenzhen 518107,China

摘要: 传统的半监督学习算法存在适用范围有限和泛化能力不足的缺陷,尤其是当训练数据集中出现了未见标签的新类样本时,算法的性能将在很大程度上受到影响。基于人工标注的有标记样本获取方式需要领域专家的参与,消耗了高昂的时间和财力成本,且由于专家背景知识的局限,无法避免标记过程中的人为错标现象。为此,以提高对未见标签样本标注正确性为出发点的半监督学习算法具有迫切的实际需要。在对自训练算法进行了详细剖析之后,提出了一种有效的新类探测半监督学习算法。首先,基于经典的极限学习机模型,构造了可处理标签增量和样本增量学习的通用增量极限学习机;然后,对自训练算法进行改进,利用标注可信度高的样本进行样本增量学习,同时设置了缓存池用以存储标注可信度低的样本;之后,使用聚类和分布一致性判定方法进行新类探测,进而实现类增量学习;最后,在仿真数据集和真实数据集上对提出算法的可行性和有效性进行了实验验证,实验结果显示在缺失类别数为3、2、1时,新算法的测试精度普遍比其他6种半监督学习算法高出将近30%、20%、10%左右,从而证实了本文提出的算法能够获得更好的新类探测半监督学习表现。

关键词: 半监督学习, 新类探测, 自训练, 极限学习机, 最大平均差异, 分布一致性

Abstract: The limited application scenario and unsatisfactory generalization capability are two main defects of traditional semi-supervised learning (SSL) algorithms. Especially, their prediction capabilities will be severely degraded when the training data set includes the samples with new labels. It is usually time-consuming and expensive to label the unlabeled samples by the domain experts. In addition, the wrongly-labeled samples are unavoidable due to the insufficient background knowledge. A new class detection SSL (NCD-SSL) algorithm is proposed in this paper to effectively solve the SSL problems where the unlabeled samples may have the new labels. First, a universal incremental extreme learning machine is designed to deal with both class-incremental and sample-incremental classification problems. Second, the self-training model is improved by using the samples with high-confidence labels and setting a buffer pool to store the samples with low-confidence labels. Third, the samples in buffer pool are further handled with clustering and distribution consistency judgement technologies so that the new classes can be detected. Finally, a series of persuasive experiments are conducted to validate the rationality and effectiveness of NCD-SSL algorithm. The experimental results show that the testing accuracies of NCD-SSL algorithm increase more than 30%, 20%, 10%for 3-classes, 2-classes, 1-class missing cases in comparison with the other six popular SSL algorithms and thus demonstrate the superior SSL performances of NCD-SSL algorithm.

Key words: Semi-supervised learning, new class detection, self-training, extreme learning machine, maximum mean discrepancy, distribution consistency