计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (9): 2078-2088.DOI: 10.3778/j.issn.1673-9418.2102018

• 人工智能 • 上一篇    下一篇

密度峰值隶属度优化的半监督Self-Training算法

刘学文1, 王继奎1,+(), 杨正国1, 李冰1, 聂飞平2   

  1. 1.兰州财经大学 信息工程学院,兰州 730020
    2.西北工业大学 光学影像分析与学习中心,西安 710072
  • 收稿日期:2021-02-03 修回日期:2021-04-06 出版日期:2022-09-01 发布日期:2021-04-19
  • 通讯作者: + E-mail: wjkweb@163.com
  • 作者简介:刘学文(1996—),男,硕士研究生,主要研究方向为机器学习、人工智能、大数据应用。
    王继奎(1978—),男,博士,副教授,CCF会员,主要研究方向为机器学习、人工智能、大数据应用。
    杨正国(1987—),男,博士,副教授,主要研究方向为机器学习、人工智能、大数据应用。
    李冰(1997—),女,硕士研究生,主要研究方向为机器学习、人工智能、大数据应用。
    聂飞平(1977—),男,博士,教授,博士生导师,CCF高级会员,主要研究方向为机器学习、模式识别、数据挖掘、计算机视觉、人工智能、图像处理。
  • 基金资助:
    国家自然科学基金(61772427);国家自然科学基金(11801345);甘肃省高等学校创新能力提升项目(2019B-97);甘肃省高等学校创新能力提升项目(2019A-069);兰州财经大学科研项目(Lzufe2020B-0010);兰州财经大学科研项目(Lzufe2020B-011);甘肃省科技计划项目(20CX9ZA057)

Semi-supervised Self-Training Algorithm for Density Peak Membership Optimization

LIU Xuewen1, WANG Jikui1,+(), YANG Zhengguo1, LI Bing1, NIE Feiping2   

  1. 1. School of Information Engineering, Lanzhou University of Finance and Economics, Lanzhou 730020, China
    2. Center for Optical Imagery Analysis and Learning, Northwestern Polytechnical University, Xi'an 710072, China
  • Received:2021-02-03 Revised:2021-04-06 Online:2022-09-01 Published:2021-04-19
  • About author:LIU Xuewen, born in 1996, M.S. candidate. His research interests include machine learning, arti-ficial intelligence and big data applications.
    WANG Jikui, born in 1978, Ph.D., associate pro-fessor, member of CCF. His research interests include machine learning, artificial intelligence and big data applications.
    YANG Zhengguo, born in 1987, Ph.D., asso-ciate professor. His research interests include ma-chine learning, artificial intelligence and big data applications.
    LI Bing, born in 1997, M.S. candidate. Her re-search interests include machine learning, arti-ficial intelligence and big data applications.
    NIE Feiping, born in 1977, Ph.D., professor, Ph.D. supervisor, senior member of CCF. His re-search interests include machine learning, pat-tern recognition, data mining, computer vision, artificial intelligence and image processing.
  • Supported by:
    National Natural Science Foundation of China(61772427);National Natural Science Foundation of China(11801345);Innovation Ability Promotion Program of Gansu Provincial Institutions of Higher Learning(2019B-97);Innovation Ability Promotion Program of Gansu Provincial Institutions of Higher Learning(2019A-069);Research Program of Lanzhou University of Finance and Economics(Lzufe2020B-0010);Research Program of Lanzhou University of Finance and Economics(Lzufe2020B-011);Science and Technology Planning Program of Gansu Province(20CX9ZA057)

摘要:

现实中由于获取标签的成本很高,大部分的数据只含有少量标签。相比监督学习和无监督学习,半监督学习能充分利用数据集中的大量无标签数据和少量有标签数据,以较少的标签成本获得较高的学习性能。自训练算法是一种经典的半监督学习算法,在其迭代优化分类器的过程中,不断从无标签样本中选取高置信度样本并由基分类器赋予标签,再将这些样本和伪标签添加进训练集。选取高置信度样本是Self-Training算法的关键,受密度峰值聚类算法(DPC)启发,将密度峰值用于高置信度样本的选取,提出了密度峰值隶属度优化的半监督Self-Training算法(STDPM)。首先,STDPM利用密度峰值发现样本的潜在空间结构信息并构造原型树。其次,搜索有标签样本在原型树上的无标签近亲结点,将无标签近亲结点的隶属于不同类簇的峰值定义为簇峰值,归一化后作为密度峰值隶属度。最后,将隶属度大于设定阈值的样本作为高置信度样本,由基分类器赋予标签后添加进训练集。STDPM充分利用密度峰值所隐含的密度和距离信息,提升了高置信度样本的选取质量,进而提升了分类性能。在8个基准数据集上进行对比实验,结果验证了STDPM算法的有效性。

关键词: 密度峰值隶属度, 簇峰值, 原型树, 近亲结点集, 自训练

Abstract:

Most of data contain only a few labels because of high cost of obtaining them in reality. Compared with supervised learning and unsupervised learning, semi-supervised learning can obtain higher learning performance with less labeling cost by making full use of large amount of unlabeled data and small amount of labeled data in datasets.Self-Training algorithm is a classical semi-supervised learning algorithm. In the process of iteratively optimizing classi-fier, high-confidence samples are continuously selected from unlabeled samples and labeled by the base classifier.Then, these samples and pseudo-labels will be added into the training sets. Selecting high-confidence samples is a critical step in the Self-Training algorithm. Inspired by the density peaks clustering (DPC) algorithm, this paper pro-poses semi-supervised Self-Training algorithm for density peak membership optimization (STDPM), which uses den-sity peak to select high-confidence samples. Firstly, STDPM takes density peak to discover the potential spatial structure information of the samples and constructs a prototype tree. Secondly, STDPM searches the unlabeled direct relatives of the labeled samples in the prototype tree, and defines the density peak of the unlabeled direct relatives that belong to different clusters as the clusters-peak. Then, clusters-peak is turned into the density peak membership after normalized. Finally, STDPM regards samples with membership greater than the set threshold as high-confidence samples that are labeled by the base classifier and added to the training set. STDPM makes full use of the density and distance information implied by the peak, which improves the selection quality of high-confi-dence samples and further improves the classification performance. Comparative experiments are conducted on 8 bench-mark datasets, which verify the effectiveness of STDPM.

Key words: density peak membership, clusters-peak, prototype tree, direct relative node sets, self-training

中图分类号: