Semi-supervised Self-Training Algorithm for Density Peak Membership Optimization

doi:10.3778/j.issn.1673-9418.2102018

Journal of Frontiers of Computer Science and Technology ›› 2022, Vol. 16 ›› Issue (9): 2078-2088.DOI: 10.3778/j.issn.1673-9418.2102018

• Artificial Intelligence • Previous Articles Next Articles

Semi-supervised Self-Training Algorithm for Density Peak Membership Optimization

LIU Xuewen¹, WANG Jikui¹^,⁺(), YANG Zhengguo¹, LI Bing¹, NIE Feiping²

1. School of Information Engineering, Lanzhou University of Finance and Economics, Lanzhou 730020, China
2. Center for Optical Imagery Analysis and Learning, Northwestern Polytechnical University, Xi'an 710072, China

Received:2021-02-03 Revised:2021-04-06 Online:2022-09-01 Published:2021-04-19
About author:LIU Xuewen, born in 1996, M.S. candidate. His research interests include machine learning, arti-ficial intelligence and big data applications.
WANG Jikui, born in 1978, Ph.D., associate pro-fessor, member of CCF. His research interests include machine learning, artificial intelligence and big data applications.
YANG Zhengguo, born in 1987, Ph.D., asso-ciate professor. His research interests include ma-chine learning, artificial intelligence and big data applications.
LI Bing, born in 1997, M.S. candidate. Her re-search interests include machine learning, arti-ficial intelligence and big data applications.
NIE Feiping, born in 1977, Ph.D., professor, Ph.D. supervisor, senior member of CCF. His re-search interests include machine learning, pat-tern recognition, data mining, computer vision, artificial intelligence and image processing.
Supported by:
National Natural Science Foundation of China(61772427);National Natural Science Foundation of China(11801345);Innovation Ability Promotion Program of Gansu Provincial Institutions of Higher Learning(2019B-97);Innovation Ability Promotion Program of Gansu Provincial Institutions of Higher Learning(2019A-069);Research Program of Lanzhou University of Finance and Economics(Lzufe2020B-0010);Research Program of Lanzhou University of Finance and Economics(Lzufe2020B-011);Science and Technology Planning Program of Gansu Province(20CX9ZA057)

密度峰值隶属度优化的半监督Self-Training算法

刘学文¹, 王继奎¹^,⁺(), 杨正国¹, 李冰¹, 聂飞平²

1.兰州财经大学信息工程学院,兰州 730020
2.西北工业大学光学影像分析与学习中心,西安 710072

通讯作者: + E-mail: wjkweb@163.com
作者简介:刘学文（1996—）,男,硕士研究生,主要研究方向为机器学习、人工智能、大数据应用。
王继奎（1978—）,男,博士,副教授,CCF会员,主要研究方向为机器学习、人工智能、大数据应用。
杨正国（1987—）,男,博士,副教授,主要研究方向为机器学习、人工智能、大数据应用。
李冰（1997—）,女,硕士研究生,主要研究方向为机器学习、人工智能、大数据应用。
聂飞平（1977—）,男,博士,教授,博士生导师,CCF高级会员,主要研究方向为机器学习、模式识别、数据挖掘、计算机视觉、人工智能、图像处理。
基金资助:
国家自然科学基金(61772427);国家自然科学基金(11801345);甘肃省高等学校创新能力提升项目(2019B-97);甘肃省高等学校创新能力提升项目(2019A-069);兰州财经大学科研项目(Lzufe2020B-0010);兰州财经大学科研项目(Lzufe2020B-011);甘肃省科技计划项目(20CX9ZA057)

Abstract

Abstract:

Most of data contain only a few labels because of high cost of obtaining them in reality. Compared with supervised learning and unsupervised learning, semi-supervised learning can obtain higher learning performance with less labeling cost by making full use of large amount of unlabeled data and small amount of labeled data in datasets.Self-Training algorithm is a classical semi-supervised learning algorithm. In the process of iteratively optimizing classi-fier, high-confidence samples are continuously selected from unlabeled samples and labeled by the base classifier.Then, these samples and pseudo-labels will be added into the training sets. Selecting high-confidence samples is a critical step in the Self-Training algorithm. Inspired by the density peaks clustering (DPC) algorithm, this paper pro-poses semi-supervised Self-Training algorithm for density peak membership optimization (STDPM), which uses den-sity peak to select high-confidence samples. Firstly, STDPM takes density peak to discover the potential spatial structure information of the samples and constructs a prototype tree. Secondly, STDPM searches the unlabeled direct relatives of the labeled samples in the prototype tree, and defines the density peak of the unlabeled direct relatives that belong to different clusters as the clusters-peak. Then, clusters-peak is turned into the density peak membership after normalized. Finally, STDPM regards samples with membership greater than the set threshold as high-confidence samples that are labeled by the base classifier and added to the training set. STDPM makes full use of the density and distance information implied by the peak, which improves the selection quality of high-confi-dence samples and further improves the classification performance. Comparative experiments are conducted on 8 bench-mark datasets, which verify the effectiveness of STDPM.

Key words: density peak membership, clusters-peak, prototype tree, direct relative node sets, self-training

摘要：

现实中由于获取标签的成本很高,大部分的数据只含有少量标签。相比监督学习和无监督学习,半监督学习能充分利用数据集中的大量无标签数据和少量有标签数据,以较少的标签成本获得较高的学习性能。自训练算法是一种经典的半监督学习算法,在其迭代优化分类器的过程中,不断从无标签样本中选取高置信度样本并由基分类器赋予标签,再将这些样本和伪标签添加进训练集。选取高置信度样本是Self-Training算法的关键,受密度峰值聚类算法（DPC）启发,将密度峰值用于高置信度样本的选取,提出了密度峰值隶属度优化的半监督Self-Training算法（STDPM）。首先,STDPM利用密度峰值发现样本的潜在空间结构信息并构造原型树。其次,搜索有标签样本在原型树上的无标签近亲结点,将无标签近亲结点的隶属于不同类簇的峰值定义为簇峰值,归一化后作为密度峰值隶属度。最后,将隶属度大于设定阈值的样本作为高置信度样本,由基分类器赋予标签后添加进训练集。STDPM充分利用密度峰值所隐含的密度和距离信息,提升了高置信度样本的选取质量,进而提升了分类性能。在8个基准数据集上进行对比实验,结果验证了STDPM算法的有效性。

关键词: 密度峰值隶属度, 簇峰值, 原型树, 近亲结点集, 自训练

CLC Number:

TP181

LIU Xuewen, WANG Jikui, YANG Zhengguo, LI Bing, NIE Feiping. Semi-supervised Self-Training Algorithm for Density Peak Membership Optimization[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(9): 2078-2088.

刘学文, 王继奎, 杨正国, 李冰, 聂飞平. 密度峰值隶属度优化的半监督Self-Training算法[J]. 计算机科学与探索, 2022, 16(9): 2078-2088.

Figures/Tables 11

Table 1 Notations in STDPM

符号	说明
$X = {x 1, x 2, ⋯, x n}$	包含 $n$ 个样本的集合
$Y = {y 1, y 2, ⋯, y n}$	包含 $n$ 个样本标签的集合
$L = {(x 1, y 1), (x 2, y 2), ⋯, (x l, y l)}$	包含 $l$ 个样本和样本标签对的集合
$U = {x 1, x 2, ⋯, x u}$	包含 $u$ 个无标签样本的集合
$P i$	样本 $x i$ 的原型
$R i = {x 1, x 2, ⋯, x k}$	包含样本 $x i$ 的 $k$ 个近亲结点的集合
$ξ i j$	样本 $x i$ 属于第 $j$ 个类簇的隶属度
$S = {x 1, x 2, ⋯, x m}$	包含 $m$ 个高置信度样本的集合

Table 1 Notations in STDPM

符号	说明
$X = {x 1, x 2, ⋯, x n}$	包含 $n$ 个样本的集合
$Y = {y 1, y 2, ⋯, y n}$	包含 $n$ 个样本标签的集合
$L = {(x 1, y 1), (x 2, y 2), ⋯, (x l, y l)}$	包含 $l$ 个样本和样本标签对的集合
$U = {x 1, x 2, ⋯, x u}$	包含 $u$ 个无标签样本的集合
$P i$	样本 $x i$ 的原型
$R i = {x 1, x 2, ⋯, x k}$	包含样本 $x i$ 的 $k$ 个近亲结点的集合
$ξ i j$	样本 $x i$ 属于第 $j$ 个类簇的隶属度
$S = {x 1, x 2, ⋯, x m}$	包含 $m$ 个高置信度样本的集合

Table 2 Samples information

标记	$ρ i$	$δ i$	$γ i = ρ i ∙ δ i$	$P i$
1	1.000	1.000	1.000	1
2	0.690	0.747	0.515	12
3	0.513	0.571	0.293	2
4	0.443	0.183	0.081	15
5	0.831	0.028	0.023	1
6	0.301	0.068	0.020	7
7	0.496	0.034	0.017	3
8	0.223	0.059	0.013	5
9	0.160	0.077	0.012	2
10	0.443	0.022	0.010	4
11	0.055	0.110	0.006	3
12	0.829	0.001	0.001	1
13	0.002	0.155	0.000	11
14	0.000	0.186	0.000	12
15	0.639	0.000	0.000	2

Table 2 Samples information

标记	$ρ i$	$δ i$	$γ i = ρ i ∙ δ i$	$P i$
1	1.000	1.000	1.000	1
2	0.690	0.747	0.515	12
3	0.513	0.571	0.293	2
4	0.443	0.183	0.081	15
5	0.831	0.028	0.023	1
6	0.301	0.068	0.020	7
7	0.496	0.034	0.017	3
8	0.223	0.059	0.013	5
9	0.160	0.077	0.012	2
10	0.443	0.022	0.010	4
11	0.055	0.110	0.006	3
12	0.829	0.001	0.001	1
13	0.002	0.155	0.000	11
14	0.000	0.186	0.000	12
15	0.639	0.000	0.000	2

Fig.1 Samples distribution and illustration of prototype tree

Fig.2 Illustration of calculating clusters-peak

Table 3 Parameters setting

算法	参数
STDPM	$α = 2, β = 0.5$
SETRED	$θ = 0.1$
STDP	$α = 2$
STDPCEW	$α = 2, θ = 0.1$
STSFCM	$ε = 0.5$

Table 3 Parameters setting

算法	参数
STDPM	$α = 2, β = 0.5$
SETRED	$θ = 0.1$
STDP	$α = 2$
STDPCEW	$α = 2, θ = 0.1$
STSFCM	$ε = 0.5$

Table 4 Datasets information

数据集	样本数	维度	类簇数
Banknote	1 372	4	2
Breast	699	10	2
Hepatitis	142	13	2
Ionosphere	351	34	2
Palm	2 000	256	100
Segment	2 310	19	7
Yale	165	1 024	15
Zoo	101	16	7

Table 5

指标	算法	Banknote	Breast	Hepatitis	Ionosphere
Accuracy	STDPM	99.57±0.43	58.87±3.05	61.33±9.73	88.59±3.08
	SETRED	99.37±0.56	58.77±1.84	59.23±8.83	86.78±2.75
	STDP	99.10±0.55	57.18±1.21	58.50±4.69	85.59±4.07
	STDPCEW	99.39±0.55	57.22±2.53	53.77±3.72	86.98±4.23
	STSFCM	99.38±0.52	57.05±2.77	56.65±4.28	87.81±1.35
F-score	STDPM	99.35±0.95	58.51±2.70	60.90±4.51	87.50±3.40
	SETRED	99.14±1.24	58.42±2.99	58.46±9.51	85.97±3.93
	STDP	98.96±0.97	57.43±2.74	58.38±7.69	84.58±2.67
	STDPCEW	99.19±0.42	57.51±2.56	52.95±9.96	86.24±2.92
	STSFCM	99.21±0.59	57.21±1.78	55.77±8.12	86.88±2.97

Table 6

指标	算法	Palm	Segment	Yale	Zoo
Accuracy	STDPM	90.52±0.40	93.64±0.69	78.84±3.24	91.29±1.46
	SETRED	88.18±0.60	93.54±0.66	77.63±2.88	87.73±3.07
	STDP	88.16±0.46	93.38±0.58	76.47±2.05	87.93±2.83
	STDPCEW	90.48±0.91	93.47±0.66	77.98±2.32	88.48±2.03
	STSFCM	88.45±0.59	93.23±0.56	78.55±1.81	87.60±3.15
F-score	STDPM	88.11±1.02	93.33±0.76	62.88±7.12	88.09±2.48
	SETRED	84.54±1.25	93.22±0.73	58.92±5.59	81.02±7.25
	STDP	84.59±1.12	93.05±0.64	55.79±7.69	79.89±6.90
	STDPCEW	87.98±1.33	93.14±0.73	57.30±6.65	82.76±5.52
	STSFCM	85.03±1.10	92.89±0.62	61.05±3.18	78.63±6.94

Fig.3 Accuracy on different ratios of labeled datasets

Table 7 Comparison of running time

数据集	STDPM			SETRED			STDP		STDPCEW			STSFCM
数据集	时间/ms	迭代次数		时间/ms		迭代次数	时间/ms	迭代次数	时间/ms	迭代次数		时间/ms		迭代次数
Banknote	398		11	6 148	6		227	4	92 256		30	18	2
Breast	169		9	2 362	21		85	4	13 818		40	15	2
Hepatitis	59		6	1 331	10		39	3	228		7	12	2
Ionosphere	85		8	1 516	15		46	4	989		17	14	2
Palm	1 428		10	32 183	13		559	6	246 625		20	418	6
Segment	1 042		12	62 533	21		726	4	564 969		24	90	7
Yale	65		5	1 661	6		31	4	214		16	38	4
Zoo	37		4	1 477	9		38	4	103		12	19	3

Fig.4 Accuracy on different datasets with different parameters α and β

References 24

[1]	SEN P C, HAJRA M, GHOSH M. Supervised classification algorithms in machine learning: a survey and review[C]// Proceedings of IEM Graph 2018. Cham: Springer, 2020: 99-111.
[2]	ALLOGHANI M, AL-JUMEILY D, MUSTAFINA J, et al. A systematic review on supervised and unsupervised ma-chine learning algorithms for data science[M]// BERRYM W, MOHAMEDA, YAPB W. Supervised and Unsuper-vised Learning for Data Science. Cham: Springer, 2020: 3-21.
[3]	CHONG Y, DING Y, YAN Q, et al. Graph-based semi-super-vised learning: a review[J]. Neurocomputing, 2020, 408: 216-230. DOI URL
[4]	KOSTOPOULOS G, KARLOS S, KOTSIANTIS S, et al. Semi-supervised regression: a recent review[J]. Journal of In-telligent & Fuzzy Systems, 2018, 35(2): 1483-1500.
[5]	韩嵩, 韩秋弘. 半监督学习研究的述评[J]. 计算机工程与应用, 2020, 56(6): 19-27. DOI
	HAN S, HAN Q H. Review of semi-supervised learning re-search[J]. Computer Engineering and Applications, 2020, 56(6): 19-27.
[6]	LI M, ZHOU Z H. SETRED: self-training with editing[C]// LNCS 3518: Proceedings of the 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Hanoi, May 18-20, 2005. Berlin, Heidelberg: Springer, 2005: 611-621.
[7]	ZIGHED D A, LALLICH S, MUHLENBACH F. Separa-bility index in supervised learning[C]// LNCS 2431: Procee-dings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery, Helsinki, Aug 19-23, 2002. Berlin, Heidelberg: Springer, 2002: 475-487.
[8]	MUHLENBACH F, LALLICH S, ZIGHED D A. Iden-tifying and handling mislabelled instances[J]. Journal of Intelligent Information Systems, 2004, 22(1): 89-109. DOI URL
[9]	XIA S, PENG D, MENG D, et al. A fast adaptive k-means with no bounds[J]. IEEE Transactions on Pattern Analy-sis and Machine Intelligence, 2020. DOI: 10.1109/TPAMI. 2020.3008694. DOI
[10]	HASECKE F, HAHN L, KUMMERT A. Fast lidar clus-tering by density and connectivity[J]. arXiv:2003.00575, 2020.
[11]	GAN H, SANG N, HUANG R, et al. Using clustering ana-lysis to improve semi-supervised classification[J]. Neurocom-puting, 2013, 101: 290-298.
[12]	GAN H, TONG X, JIANG Q, et al. Discussion of FCM al-gorithm with partial supervision[C]// Proceedings of the 8th International Symposium on Distributed Computing and Applications to Business, Engineering and Science, Wuhan, 2009. Beijing: Publishing House of Electronics Industry, 2009: 27-31.
[13]	TONG X, JIANG Q, SANG N, et al. The feature weighted FCM algorithm with semi-supervised[C]// Proceedings of the 8th International Symposium on Distributed Computing and Applications to Business, Engineering and Science,Wuhan, 2009. Beijing: Publishing House of Electronics Industry, 2009: 22-26.
[14]	RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492-1496. DOI URL
[15]	陈叶旺, 申莲莲, 钟才明, 等. 密度峰值聚类算法综述[J]. 计算机研究与发展, 2020, 57(2): 378-394.
	CHEN Y W, SHEN L L, ZHONG C M, et al. Survey on density peak clustering algorithm[J]. Journal of Computer Research and Development, 2020, 57(2): 378-394.
[16]	丁志成, 葛洪伟. 优化分配策略的密度峰值聚类算法[J]. 计算机科学与探索, 2020, 14(5): 792-802. DOI
	DING Z C, GE H W. Density peaks clustering with opti-mized allocation strategy[J]. Journal of Frontiers of Com-puter Science and Technology, 2020, 14(5): 792-802.
[17]	丁世飞, 徐晓, 王艳茹. 基于不相似性度量优化的密度峰值聚类算法[J]. 软件学报, 2020, 31(11): 3321-3333.
	DING S F, XU X, WANG Y R. Optimized density peaks clustering algorithm based on dissimilarity measure[J]. Jour-nal of Software, 2020, 31(11): 3321-3333.
[18]	柏锷湘, 罗可, 罗潇. 结合自然和共享最近邻的密度峰值聚类算法[J]. 计算机科学与探索, 2021, 15(5): 931-940. DOI
	BAI E X, LUO K, LUO X. Peak density clustering algori-thm combining natural and shared nearest neighbor[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(5): 931-940.
[19]	刘娟, 万静. 自然反向最近邻优化的密度峰值聚类算法[J]. 计算机科学与探索, 2021, 15(10): 1888-1899. DOI
	LIU J, WAN J. Optimized density peak clustering algorithm by natural reverse nearest neighbor[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(10): 1888-1899.
[20]	钱雪忠, 金辉. 自适应聚合策略优化的密度峰值聚类算法[J]. 计算机科学与探索, 2020, 14(4): 712-720. DOI
	QIAN X Z, JIN H. Optimized density peak clustering algo-rithm by adaptive aggregation strategy[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(4): 712-720.
[21]	刘沧生, 许青林. 基于密度峰值优化的模糊C均值聚类算法[J]. 计算机工程与应用, 2018, 54(14): 153-157. DOI
	LIU C S, XU Q L. Fuzzy C-means clustering algorithm based on density peak value optimization[J]. Computer Enginee-ring and Applications, 2018, 54(14): 153-157.
[22]	WU D, SHANG M, LUO X, et al. Self-training semi-super-vised classification based on density peaks of data[J]. Neu-rocomputing, 2018, 275: 180-191.
[23]	艾震鹏, 王振友. 基于数据密度的半监督自训练分类算法[J]. 计算机应用研究, 2019, 36(4): 1072-1074.
	AI Z P, WANG Z Y. Self-training semi-supervised classifi-cation based on density of data[J]. Application Research of Computers, 2019, 36(4): 1072-1074.
[24]	卫丹妮, 杨有龙, 仇海全. 结合密度峰值和切边权值的自训练算法[J]. 计算机工程与应用, 2021, 57(2): 70-76. DOI
	WEI D N, YANG Y L, QIU H Q. Self-training algorithm combining density peak and cut edge weight[J]. Computer Engineering and Applications, 2021, 57(2): 70-76.

Semi-supervised Self-Training Algorithm for Density Peak Membership Optimization

密度峰值隶属度优化的半监督Self-Training算法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 11

References 24

Related Articles 14

Recommended Articles

Metrics

[1]	CHEN Yang, WANG Shitong. Ensemble Method of Diverse Regularized Extreme Learning Machines [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(8): 1819-1928.
[2]	XIE Juanying, ZHANG Kaiyun. XR-MSF-Unet: Automatic Segmentation Model for COVID-19 Lung CT Images [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(8): 1850-1864.
[3]	YANG Zheng, DENG Zhaohong, LUO Xiaoqing, GU Xin, WANG Shitong. Target Tracking System Constructed by ELM-AE and Transfer Representation Learning [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1633-1648.
[4]	ZHANG Zhuang, WANG Shitong. Ensemble Model of Takagi-Sugeno-Kang Fuzzy Classifiers for Imbalanced Data [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(6): 1374-1382.
[5]	XIE Xintong, HU Yueyang, LIU Xuanzhe, ZHAO Yaoshuai, JIANG Hai’ou. Rumor Detection Based on Representative User Characteristics Learning Through Propagation [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(6): 1334-1342.
[6]	DONG Wenbo, SUN Shiliang, YIN Minzhi. Research and Development of Medical Knowledge Graph Reasoning [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(6): 1193-1213.
[7]	SHEN Ruicai, ZHAI Junhai, HOU Yingzhen. Multi-discriminator Generative Adversarial Networks Based on Selective Ensemble Learning [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(6): 1429-1438.
[8]	LIN Jiawei, WANG Shitong. Deep Adversarial-Reconstruction-Classification Networks for Unsupervised Domain Adaptation [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(5): 1107-1116.
[9]	SUN Wu, DENG Zhaohong, LOU Qiongdan, GU Xin, WANG Shitong. Unsupervised Heterogeneous Domain Adaptation with Fuzzy Rule Learning [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(2): 403-412.
[10]	ZHANG Fengyexin, WANG Jun, JIA Xiuyi, PAN Xiang, DENG Zhaohong, SHI Jun, WANG Shitong. Label Distribution Learning for Computer Aided Diagnosis of Multi-class ASD Classification [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(1): 194-204.
[11]	FU Bin⁺, WANG Zhihai, WANG Zhongfeng. Algorithm of Classifier Selection for Maximizing the Margin [J]. Journal of Frontiers of Computer Science and Technology, 2011, 5(1): 59-67.
[12]	PAN Shirui¹, ZHANG Yang^1,2+, LI Xue³, WANG Yong⁴. Nearest Neighbor Algorithm for Positive and Unlabeled Learning with Uncertainty [J]. Journal of Frontiers of Computer Science and Technology, 2010, 4(9): 769-779.
[13]	YUAN Lei¹, ZHANG Yang²⁺, LI Mei¹, LI Xue³, WANG Yong⁴. Programming the VFDT Algorithm in Data Stream Manage-ment System* [J]. Journal of Frontiers of Computer Science and Technology, 2010, 4(8): 673-682.
[14]	SHI Guoqiang, NIU Changyong, FAN Ming⁺. Constructing Ensembles of Rule-based Classifiers Using PCA* [J]. Journal of Frontiers of Computer Science and Technology, 2010, 4(5): 455-463.