计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (11): 1879-1887.DOI: 10.3778/j.issn.1673-9418.1912073

• 人工智能 • 上一篇    下一篇

最大化AUC的正例未标注分类及其增量算法

马毓敏,王士同   

  1. 江南大学 人工智能与计算机学院, 江苏 无锡 214122
  • 出版日期:2020-11-01 发布日期:2020-11-09

Maximize AUC for Positive-Unlabeled Classification and Incremental Algorithm

MA Yumin, WANG Shitong   

  1. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2020-11-01 Published:2020-11-09

摘要:

正例未标注分类简称PU分类,由于只有正例样本与未标注样本,传统的分类方法在PU分类中往往效果不甚理想。为此利用PU分类下的AUC与传统分类下的AUC关系,提出了将传统分类方法中AUC作为目标函数应用到PU分类中,利用高斯核函数将原始样本映射到高维空间使数据线性可分。通过优化AUC目标函数得到解析解避免了多次迭代的麻烦,并可以推导出增量公式,加快了运算速度。实验结果表明,所提算法实现了与训练集内所有正例与负例标签都已知的理想支持向量机(SVM)相近的性能,并且实现了快速增量,是处理现实问题的有力工具。

关键词: 机器学习, PU分类, AUC, 增量算法

Abstract:

Positive-unlabeled classification is referred to as PU classification. Since there are only positive samples and unlabeled samples, the traditional classification methods are not effective in PU classification. For this reason, this paper proposes to apply AUC (area under receiver operating characteristic curve) in traditional classification methods as an objective function to PU classification because of the relationship between AUC under PU classification and traditional classification. For making the data linearly separable, this paper uses Gaussian kernel function to map the original sample to high-dimensional space. Optimizing the AUC objective function to obtain an analytical solution avoids the trouble of multiple iterations, and can derive an incremental formula to speed up the operation speed. Experimental results show that the proposed algorithm achieves performance similar to an ideal support vector machine (SVM) whose labels are known for all positive and negative examples in the training set, and achieves rapid increments. It is a powerful tool for dealing with real problems.

Key words: machine learning, positive-unlabeled (PU) classification, AUC, incremental algorithm