密度峰值隶属度优化的半监督Self-Training算法

doi:10.3778/j.issn.1673-9418.2102018

计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (9): 2078-2088.DOI: 10.3778/j.issn.1673-9418.2102018

密度峰值隶属度优化的半监督Self-Training算法

刘学文¹, 王继奎¹^,⁺(), 杨正国¹, 李冰¹, 聂飞平²

1.兰州财经大学信息工程学院,兰州 730020
2.西北工业大学光学影像分析与学习中心,西安 710072

收稿日期:2021-02-03 修回日期:2021-04-06 出版日期:2022-09-01 发布日期:2021-04-19
通讯作者: + E-mail: wjkweb@163.com
作者简介:刘学文（1996—）,男,硕士研究生,主要研究方向为机器学习、人工智能、大数据应用。
王继奎（1978—）,男,博士,副教授,CCF会员,主要研究方向为机器学习、人工智能、大数据应用。
杨正国（1987—）,男,博士,副教授,主要研究方向为机器学习、人工智能、大数据应用。
李冰（1997—）,女,硕士研究生,主要研究方向为机器学习、人工智能、大数据应用。
聂飞平（1977—）,男,博士,教授,博士生导师,CCF高级会员,主要研究方向为机器学习、模式识别、数据挖掘、计算机视觉、人工智能、图像处理。
基金资助:
国家自然科学基金(61772427);国家自然科学基金(11801345);甘肃省高等学校创新能力提升项目(2019B-97);甘肃省高等学校创新能力提升项目(2019A-069);兰州财经大学科研项目(Lzufe2020B-0010);兰州财经大学科研项目(Lzufe2020B-011);甘肃省科技计划项目(20CX9ZA057)

Semi-supervised Self-Training Algorithm for Density Peak Membership Optimization

LIU Xuewen¹, WANG Jikui¹^,⁺(), YANG Zhengguo¹, LI Bing¹, NIE Feiping²

1. School of Information Engineering, Lanzhou University of Finance and Economics, Lanzhou 730020, China
2. Center for Optical Imagery Analysis and Learning, Northwestern Polytechnical University, Xi'an 710072, China

Received:2021-02-03 Revised:2021-04-06 Online:2022-09-01 Published:2021-04-19
About author:LIU Xuewen, born in 1996, M.S. candidate. His research interests include machine learning, arti-ficial intelligence and big data applications.
WANG Jikui, born in 1978, Ph.D., associate pro-fessor, member of CCF. His research interests include machine learning, artificial intelligence and big data applications.
YANG Zhengguo, born in 1987, Ph.D., asso-ciate professor. His research interests include ma-chine learning, artificial intelligence and big data applications.
LI Bing, born in 1997, M.S. candidate. Her re-search interests include machine learning, arti-ficial intelligence and big data applications.
NIE Feiping, born in 1977, Ph.D., professor, Ph.D. supervisor, senior member of CCF. His re-search interests include machine learning, pat-tern recognition, data mining, computer vision, artificial intelligence and image processing.
Supported by:
National Natural Science Foundation of China(61772427);National Natural Science Foundation of China(11801345);Innovation Ability Promotion Program of Gansu Provincial Institutions of Higher Learning(2019B-97);Innovation Ability Promotion Program of Gansu Provincial Institutions of Higher Learning(2019A-069);Research Program of Lanzhou University of Finance and Economics(Lzufe2020B-0010);Research Program of Lanzhou University of Finance and Economics(Lzufe2020B-011);Science and Technology Planning Program of Gansu Province(20CX9ZA057)

摘要/Abstract

摘要：

现实中由于获取标签的成本很高,大部分的数据只含有少量标签。相比监督学习和无监督学习,半监督学习能充分利用数据集中的大量无标签数据和少量有标签数据,以较少的标签成本获得较高的学习性能。自训练算法是一种经典的半监督学习算法,在其迭代优化分类器的过程中,不断从无标签样本中选取高置信度样本并由基分类器赋予标签,再将这些样本和伪标签添加进训练集。选取高置信度样本是Self-Training算法的关键,受密度峰值聚类算法（DPC）启发,将密度峰值用于高置信度样本的选取,提出了密度峰值隶属度优化的半监督Self-Training算法（STDPM）。首先,STDPM利用密度峰值发现样本的潜在空间结构信息并构造原型树。其次,搜索有标签样本在原型树上的无标签近亲结点,将无标签近亲结点的隶属于不同类簇的峰值定义为簇峰值,归一化后作为密度峰值隶属度。最后,将隶属度大于设定阈值的样本作为高置信度样本,由基分类器赋予标签后添加进训练集。STDPM充分利用密度峰值所隐含的密度和距离信息,提升了高置信度样本的选取质量,进而提升了分类性能。在8个基准数据集上进行对比实验,结果验证了STDPM算法的有效性。

关键词: 密度峰值隶属度, 簇峰值, 原型树, 近亲结点集, 自训练

Abstract:

Most of data contain only a few labels because of high cost of obtaining them in reality. Compared with supervised learning and unsupervised learning, semi-supervised learning can obtain higher learning performance with less labeling cost by making full use of large amount of unlabeled data and small amount of labeled data in datasets.Self-Training algorithm is a classical semi-supervised learning algorithm. In the process of iteratively optimizing classi-fier, high-confidence samples are continuously selected from unlabeled samples and labeled by the base classifier.Then, these samples and pseudo-labels will be added into the training sets. Selecting high-confidence samples is a critical step in the Self-Training algorithm. Inspired by the density peaks clustering (DPC) algorithm, this paper pro-poses semi-supervised Self-Training algorithm for density peak membership optimization (STDPM), which uses den-sity peak to select high-confidence samples. Firstly, STDPM takes density peak to discover the potential spatial structure information of the samples and constructs a prototype tree. Secondly, STDPM searches the unlabeled direct relatives of the labeled samples in the prototype tree, and defines the density peak of the unlabeled direct relatives that belong to different clusters as the clusters-peak. Then, clusters-peak is turned into the density peak membership after normalized. Finally, STDPM regards samples with membership greater than the set threshold as high-confidence samples that are labeled by the base classifier and added to the training set. STDPM makes full use of the density and distance information implied by the peak, which improves the selection quality of high-confi-dence samples and further improves the classification performance. Comparative experiments are conducted on 8 bench-mark datasets, which verify the effectiveness of STDPM.

Key words: density peak membership, clusters-peak, prototype tree, direct relative node sets, self-training

中图分类号:

TP181

刘学文, 王继奎, 杨正国, 李冰, 聂飞平. 密度峰值隶属度优化的半监督Self-Training算法[J]. 计算机科学与探索, 2022, 16(9): 2078-2088.

LIU Xuewen, WANG Jikui, YANG Zhengguo, LI Bing, NIE Feiping. Semi-supervised Self-Training Algorithm for Density Peak Membership Optimization[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(9): 2078-2088.

图/表 11

表1 STDPM中的符号

Table 1 Notations in STDPM

符号	说明
$X = {x 1, x 2, ⋯, x n}$	包含 $n$ 个样本的集合
$Y = {y 1, y 2, ⋯, y n}$	包含 $n$ 个样本标签的集合
$L = {(x 1, y 1), (x 2, y 2), ⋯, (x l, y l)}$	包含 $l$ 个样本和样本标签对的集合
$U = {x 1, x 2, ⋯, x u}$	包含 $u$ 个无标签样本的集合
$P i$	样本 $x i$ 的原型
$R i = {x 1, x 2, ⋯, x k}$	包含样本 $x i$ 的 $k$ 个近亲结点的集合
$ξ i j$	样本 $x i$ 属于第 $j$ 个类簇的隶属度
$S = {x 1, x 2, ⋯, x m}$	包含 $m$ 个高置信度样本的集合

表1 STDPM中的符号

Table 1 Notations in STDPM

符号	说明
$X = {x 1, x 2, ⋯, x n}$	包含 $n$ 个样本的集合
$Y = {y 1, y 2, ⋯, y n}$	包含 $n$ 个样本标签的集合
$L = {(x 1, y 1), (x 2, y 2), ⋯, (x l, y l)}$	包含 $l$ 个样本和样本标签对的集合
$U = {x 1, x 2, ⋯, x u}$	包含 $u$ 个无标签样本的集合
$P i$	样本 $x i$ 的原型
$R i = {x 1, x 2, ⋯, x k}$	包含样本 $x i$ 的 $k$ 个近亲结点的集合
$ξ i j$	样本 $x i$ 属于第 $j$ 个类簇的隶属度
$S = {x 1, x 2, ⋯, x m}$	包含 $m$ 个高置信度样本的集合

表2 样本信息

Table 2 Samples information

标记	$ρ i$	$δ i$	$γ i = ρ i ∙ δ i$	$P i$
1	1.000	1.000	1.000	1
2	0.690	0.747	0.515	12
3	0.513	0.571	0.293	2
4	0.443	0.183	0.081	15
5	0.831	0.028	0.023	1
6	0.301	0.068	0.020	7
7	0.496	0.034	0.017	3
8	0.223	0.059	0.013	5
9	0.160	0.077	0.012	2
10	0.443	0.022	0.010	4
11	0.055	0.110	0.006	3
12	0.829	0.001	0.001	1
13	0.002	0.155	0.000	11
14	0.000	0.186	0.000	12
15	0.639	0.000	0.000	2

表2 样本信息

Table 2 Samples information

标记	$ρ i$	$δ i$	$γ i = ρ i ∙ δ i$	$P i$
1	1.000	1.000	1.000	1
2	0.690	0.747	0.515	12
3	0.513	0.571	0.293	2
4	0.443	0.183	0.081	15
5	0.831	0.028	0.023	1
6	0.301	0.068	0.020	7
7	0.496	0.034	0.017	3
8	0.223	0.059	0.013	5
9	0.160	0.077	0.012	2
10	0.443	0.022	0.010	4
11	0.055	0.110	0.006	3
12	0.829	0.001	0.001	1
13	0.002	0.155	0.000	11
14	0.000	0.186	0.000	12
15	0.639	0.000	0.000	2

图1 样本分布和原型树示意图

Fig.1 Samples distribution and illustration of prototype tree

图2 簇峰值计算示意图

Fig.2 Illustration of calculating clusters-peak

表3 参数设置

Table 3 Parameters setting

算法	参数
STDPM	$α = 2, β = 0.5$
SETRED	$θ = 0.1$
STDP	$α = 2$
STDPCEW	$α = 2, θ = 0.1$
STSFCM	$ε = 0.5$

表3 参数设置

Table 3 Parameters setting

算法	参数
STDPM	$α = 2, β = 0.5$
SETRED	$θ = 0.1$
STDP	$α = 2$
STDPCEW	$α = 2, θ = 0.1$
STSFCM	$ε = 0.5$

表4 数据集信息

Table 4 Datasets information

数据集	样本数	维度	类簇数
Banknote	1 372	4	2
Breast	699	10	2
Hepatitis	142	13	2
Ionosphere	351	34	2
Palm	2 000	256	100
Segment	2 310	19	7
Yale	165	1 024	15
Zoo	101	16	7

表5 Performance on Banknote, Breast, Hepatitis and Ionosphere datasets (mean±std) 单位：%

Table 5

指标	算法	Banknote	Breast	Hepatitis	Ionosphere
Accuracy	STDPM	99.57±0.43	58.87±3.05	61.33±9.73	88.59±3.08
	SETRED	99.37±0.56	58.77±1.84	59.23±8.83	86.78±2.75
	STDP	99.10±0.55	57.18±1.21	58.50±4.69	85.59±4.07
	STDPCEW	99.39±0.55	57.22±2.53	53.77±3.72	86.98±4.23
	STSFCM	99.38±0.52	57.05±2.77	56.65±4.28	87.81±1.35
F-score	STDPM	99.35±0.95	58.51±2.70	60.90±4.51	87.50±3.40
	SETRED	99.14±1.24	58.42±2.99	58.46±9.51	85.97±3.93
	STDP	98.96±0.97	57.43±2.74	58.38±7.69	84.58±2.67
	STDPCEW	99.19±0.42	57.51±2.56	52.95±9.96	86.24±2.92
	STSFCM	99.21±0.59	57.21±1.78	55.77±8.12	86.88±2.97

表6 Performance on Palm, Segment, Yale and Zoo datasets (mean±std) 单位：%

Table 6

指标	算法	Palm	Segment	Yale	Zoo
Accuracy	STDPM	90.52±0.40	93.64±0.69	78.84±3.24	91.29±1.46
	SETRED	88.18±0.60	93.54±0.66	77.63±2.88	87.73±3.07
	STDP	88.16±0.46	93.38±0.58	76.47±2.05	87.93±2.83
	STDPCEW	90.48±0.91	93.47±0.66	77.98±2.32	88.48±2.03
	STSFCM	88.45±0.59	93.23±0.56	78.55±1.81	87.60±3.15
F-score	STDPM	88.11±1.02	93.33±0.76	62.88±7.12	88.09±2.48
	SETRED	84.54±1.25	93.22±0.73	58.92±5.59	81.02±7.25
	STDP	84.59±1.12	93.05±0.64	55.79±7.69	79.89±6.90
	STDPCEW	87.98±1.33	93.14±0.73	57.30±6.65	82.76±5.52
	STSFCM	85.03±1.10	92.89±0.62	61.05±3.18	78.63±6.94

图3 不同有标签样本比例数据集上的准确率

Fig.3 Accuracy on different ratios of labeled datasets

表7 运行时间对比

Table 7 Comparison of running time

数据集	STDPM			SETRED			STDP		STDPCEW			STSFCM
数据集	时间/ms	迭代次数		时间/ms		迭代次数	时间/ms	迭代次数	时间/ms	迭代次数		时间/ms		迭代次数
Banknote	398		11	6 148	6		227	4	92 256		30	18	2
Breast	169		9	2 362	21		85	4	13 818		40	15	2
Hepatitis	59		6	1 331	10		39	3	228		7	12	2
Ionosphere	85		8	1 516	15		46	4	989		17	14	2
Palm	1 428		10	32 183	13		559	6	246 625		20	418	6
Segment	1 042		12	62 533	21		726	4	564 969		24	90	7
Yale	65		5	1 661	6		31	4	214		16	38	4
Zoo	37		4	1 477	9		38	4	103		12	19	3

图4 不同参数 α和 β在不同数据集上的准确率

Fig.4 Accuracy on different datasets with different parameters α and β

参考文献 24

[1]	SEN P C, HAJRA M, GHOSH M. Supervised classification algorithms in machine learning: a survey and review[C]// Proceedings of IEM Graph 2018. Cham: Springer, 2020: 99-111.
[2]	ALLOGHANI M, AL-JUMEILY D, MUSTAFINA J, et al. A systematic review on supervised and unsupervised ma-chine learning algorithms for data science[M]// BERRYM W, MOHAMEDA, YAPB W. Supervised and Unsuper-vised Learning for Data Science. Cham: Springer, 2020: 3-21.
[3]	CHONG Y, DING Y, YAN Q, et al. Graph-based semi-super-vised learning: a review[J]. Neurocomputing, 2020, 408: 216-230. DOI URL
[4]	KOSTOPOULOS G, KARLOS S, KOTSIANTIS S, et al. Semi-supervised regression: a recent review[J]. Journal of In-telligent & Fuzzy Systems, 2018, 35(2): 1483-1500.
[5]	韩嵩, 韩秋弘. 半监督学习研究的述评[J]. 计算机工程与应用, 2020, 56(6): 19-27. DOI
	HAN S, HAN Q H. Review of semi-supervised learning re-search[J]. Computer Engineering and Applications, 2020, 56(6): 19-27.
[6]	LI M, ZHOU Z H. SETRED: self-training with editing[C]// LNCS 3518: Proceedings of the 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Hanoi, May 18-20, 2005. Berlin, Heidelberg: Springer, 2005: 611-621.
[7]	ZIGHED D A, LALLICH S, MUHLENBACH F. Separa-bility index in supervised learning[C]// LNCS 2431: Procee-dings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery, Helsinki, Aug 19-23, 2002. Berlin, Heidelberg: Springer, 2002: 475-487.
[8]	MUHLENBACH F, LALLICH S, ZIGHED D A. Iden-tifying and handling mislabelled instances[J]. Journal of Intelligent Information Systems, 2004, 22(1): 89-109. DOI URL
[9]	XIA S, PENG D, MENG D, et al. A fast adaptive k-means with no bounds[J]. IEEE Transactions on Pattern Analy-sis and Machine Intelligence, 2020. DOI: 10.1109/TPAMI. 2020.3008694. DOI
[10]	HASECKE F, HAHN L, KUMMERT A. Fast lidar clus-tering by density and connectivity[J]. arXiv:2003.00575, 2020.
[11]	GAN H, SANG N, HUANG R, et al. Using clustering ana-lysis to improve semi-supervised classification[J]. Neurocom-puting, 2013, 101: 290-298.
[12]	GAN H, TONG X, JIANG Q, et al. Discussion of FCM al-gorithm with partial supervision[C]// Proceedings of the 8th International Symposium on Distributed Computing and Applications to Business, Engineering and Science, Wuhan, 2009. Beijing: Publishing House of Electronics Industry, 2009: 27-31.
[13]	TONG X, JIANG Q, SANG N, et al. The feature weighted FCM algorithm with semi-supervised[C]// Proceedings of the 8th International Symposium on Distributed Computing and Applications to Business, Engineering and Science,Wuhan, 2009. Beijing: Publishing House of Electronics Industry, 2009: 22-26.
[14]	RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492-1496. DOI URL
[15]	陈叶旺, 申莲莲, 钟才明, 等. 密度峰值聚类算法综述[J]. 计算机研究与发展, 2020, 57(2): 378-394.
	CHEN Y W, SHEN L L, ZHONG C M, et al. Survey on density peak clustering algorithm[J]. Journal of Computer Research and Development, 2020, 57(2): 378-394.
[16]	丁志成, 葛洪伟. 优化分配策略的密度峰值聚类算法[J]. 计算机科学与探索, 2020, 14(5): 792-802. DOI
	DING Z C, GE H W. Density peaks clustering with opti-mized allocation strategy[J]. Journal of Frontiers of Com-puter Science and Technology, 2020, 14(5): 792-802.
[17]	丁世飞, 徐晓, 王艳茹. 基于不相似性度量优化的密度峰值聚类算法[J]. 软件学报, 2020, 31(11): 3321-3333.
	DING S F, XU X, WANG Y R. Optimized density peaks clustering algorithm based on dissimilarity measure[J]. Jour-nal of Software, 2020, 31(11): 3321-3333.
[18]	柏锷湘, 罗可, 罗潇. 结合自然和共享最近邻的密度峰值聚类算法[J]. 计算机科学与探索, 2021, 15(5): 931-940. DOI
	BAI E X, LUO K, LUO X. Peak density clustering algori-thm combining natural and shared nearest neighbor[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(5): 931-940.
[19]	刘娟, 万静. 自然反向最近邻优化的密度峰值聚类算法[J]. 计算机科学与探索, 2021, 15(10): 1888-1899. DOI
	LIU J, WAN J. Optimized density peak clustering algorithm by natural reverse nearest neighbor[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(10): 1888-1899.
[20]	钱雪忠, 金辉. 自适应聚合策略优化的密度峰值聚类算法[J]. 计算机科学与探索, 2020, 14(4): 712-720. DOI
	QIAN X Z, JIN H. Optimized density peak clustering algo-rithm by adaptive aggregation strategy[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(4): 712-720.
[21]	刘沧生, 许青林. 基于密度峰值优化的模糊C均值聚类算法[J]. 计算机工程与应用, 2018, 54(14): 153-157. DOI
	LIU C S, XU Q L. Fuzzy C-means clustering algorithm based on density peak value optimization[J]. Computer Enginee-ring and Applications, 2018, 54(14): 153-157.
[22]	WU D, SHANG M, LUO X, et al. Self-training semi-super-vised classification based on density peaks of data[J]. Neu-rocomputing, 2018, 275: 180-191.
[23]	艾震鹏, 王振友. 基于数据密度的半监督自训练分类算法[J]. 计算机应用研究, 2019, 36(4): 1072-1074.
	AI Z P, WANG Z Y. Self-training semi-supervised classifi-cation based on density of data[J]. Application Research of Computers, 2019, 36(4): 1072-1074.
[24]	卫丹妮, 杨有龙, 仇海全. 结合密度峰值和切边权值的自训练算法[J]. 计算机工程与应用, 2021, 57(2): 70-76. DOI
	WEI D N, YANG Y L, QIU H Q. Self-training algorithm combining density peak and cut edge weight[J]. Computer Engineering and Applications, 2021, 57(2): 70-76.

密度峰值隶属度优化的半监督Self-Training算法

Semi-supervised Self-Training Algorithm for Density Peak Membership Optimization

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 24

相关文章 14

编辑推荐

Metrics

[1]	陈洋, 王士同. 多样性正则化极限学习机的集成方法[J]. 计算机科学与探索, 2022, 16(8): 1819-1928.
[2]	谢娟英, 张凯云. XR-MSF-Unet：新冠肺炎肺部CT图像自动分割模型[J]. 计算机科学与探索, 2022, 16(8): 1850-1864.
[3]	杨政, 邓赵红, 罗晓清, 顾鑫, 王士同. 利用ELM-AE和迁移表征学习构建的目标跟踪系统[J]. 计算机科学与探索, 2022, 16(7): 1633-1648.
[4]	张壮, 王士同. 不平衡数据的Takagi-Sugeno-Kang模糊分类集成模型[J]. 计算机科学与探索, 2022, 16(6): 1374-1382.
[5]	谢欣彤, 胡悦阳, 刘譞哲, 赵耀帅, 姜海鸥. 传播用户代表性特征学习的谣言检测方法[J]. 计算机科学与探索, 2022, 16(6): 1334-1342.
[6]	董文波, 孙仕亮, 殷敏智. 医学知识推理研究现状与发展[J]. 计算机科学与探索, 2022, 16(6): 1193-1213.
[7]	申瑞彩, 翟俊海, 侯璎真. 选择性集成学习多判别器生成对抗网络[J]. 计算机科学与探索, 2022, 16(6): 1429-1438.
[8]	林佳伟, 王士同. 用于无监督域适应的深度对抗重构分类网络[J]. 计算机科学与探索, 2022, 16(5): 1107-1116.
[9]	孙武, 邓赵红, 娄琼丹, 顾鑫, 王士同. 基于模糊规则学习的无监督异构领域自适应[J]. 计算机科学与探索, 2022, 16(2): 403-412.
[10]	章枫叶欣, 王骏, 贾修一, 潘祥, 邓赵红, 施俊, 王士同. 面向多分类自闭症辅助诊断的标记分布学习[J]. 计算机科学与探索, 2022, 16(1): 194-204.
[11]	付彬, 王志海, 王中锋. 最大化边际的分类器选取算法 [J]. 计算机科学与探索, 2011, 5(1): 59-67.
[12]	潘世瑞1 , 张阳1,2+ , 李雪 3 , 王勇 4 . 针对不确定正例和未标记学习的最近邻算法*[J]. 计算机科学与探索, 2010, 4(9): 769-779.
[13]	袁磊1 , 张阳2+ , 李梅1 , 李雪3 , 王勇4 . 在数据流管理系统中实现快速决策树算法*[J]. 计算机科学与探索, 2010, 4(8): 673-682.
[14]	石国强, 牛常勇, 范明+ . 使用PCA建立基于规则的组合分类器*[J]. 计算机科学与探索, 2010, 4(5): 455-463.