结合人工蜂群与K-means聚类的特征选择

doi:10.3778/j.issn.1673-9418.2212075

摘要/Abstract

摘要： K-means聚类是一种简捷高效、收敛速度快且易于实现的统计分析方法，但是传统的[K-means]聚类算法对初始聚类中心的选取敏感且易陷入局部最优，同时多数无监督特征选择算法容易忽视特征之间的联系。为此，提出了一种结合人工蜂群与[K-means]聚类的特征选择方法。首先，为了使同一簇中样本的相似度高而不同簇中样本的相似度低，基于簇内聚集度和簇间离散度构建了新的适应度函数，更好地反映各样本的特性，进而构建了蜜源被选择新的概率表达式；其次，设计了随着迭代次数的增加而数值逐渐减小的权重，提出了使蜂群搜索范围动态缩进的蜜源位置更新表达式；然后，为了弥补传统的欧氏距离在计算距离时仅考虑向量之间的累积差异而表现出的局限性，构造了同时考虑样本影响程度不同以及样本的相似性的加权欧氏距离表达式；最后，引入标准差和距离相关系数，定义了特征区分度与特征代表性，以二者之积度量特征重要性。实验结果表明，所提算法加快了人工蜂群算法的收敛速度并提高了[K-means]算法的聚类效果，同时也有效地提升了特征选择的分类效果。

关键词: 特征选择, 人工蜂群, [K-means]聚类, 特征重要度

Abstract: K-means clustering is a simple and efficient, fast in convergence and easy to implement statistical analysis method. However, the traditional [K-means] clustering algorithm is sensitive to the selection of initial clustering centers and easy to fall into a local optimum, and at the same time, most unsupervised feature selection algorithms are easy to ignore the relationship between features. To solve the above issues, this paper proposes a feature selection algorithm combining artificial bee colony with [K-means] clustering. Firstly, to make the similarity of samples in the same cluster high and the similarity of the samples in different clusters low, a new fitness function is constructed based on the clustering degree within the cluster and the dispersion degree between the clusters, which can better reflect the characteristics of each sample, and then a new probability expression of the honey source being selected is constructed. Secondly, the weight which decreases gradually with the increase of the number of iterations is designed, and the honey source location update expression that makes the search range of the bee colony dynamically indent is proposed. Thirdly, to make up for the limitation of the traditional Euclidean distance which only considers the cumulative difference between vectors when calculating the distance, a weighted Euclidean distance expression which simultaneously considers both the different influence degrees of the samples and the similarity of the samples is constructed. Finally, the standard deviation and distance correlation coefficient are introduced to define feature discrimination and feature representativeness, and the product of them is used to measure the importance of features. Experimental results show that the proposed algorithm accelerates the convergence speed of artificial bee colony algorithm and improves the clustering effect of [K-means] algorithm, and also effectively improves the classification effect of feature selection.

Key words: feature selection, artificial bee colony, [K-means] clustering, feature importance

孙林, 刘梦含, 薛占熬. 结合人工蜂群与K-means聚类的特征选择[J]. 计算机科学与探索, 2024, 18(1): 93-110.

SUN Lin, LIU Menghan, XUE Zhan’ao. Feature Selection Combining Artificial Bee Colony with [K-means] Clustering[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(1): 93-110.

参考文献

[1] SUN L, ZHANG J X, DING W P, et al. Feature reduction for imbalanced data classification using similarity-based feature clustering with adaptive weighted K-nearest neighbors[J]. Information Sciences, 2022, 593: 591-613.
[2] 孙林, 徐枫, 李硕, 等. 基于ReliefF和最大相关最小冗余的多标记特征选择[J]. 河南师范大学学报(自然科学版), 2023, 51(6): 21-29.
SUN L, XU F, LI S, et al. Multilabel feature selection algorithm using ReliefF and mRMR[J]. Journal of Henan Normal University (Natural Science Edition), 2023, 51(6): 21-29.
[3] 徐天杰, 王平心, 杨习贝. 基于人工蜂群的三支k-means聚类算法[J]. 计算机科学, 2023, 50(6): 116-121.
XU T J, WANG P X, YANG X B. Three-way k-means clustering based on artificial bee colony[J]. Computer Science, 2023, 50(6): 116-121.
[4] REDMOND S J, HENEGHAN C. A method for initialising the K-means clustering algorithm using kd-trees[J]. Pattern Recognition Letters, 2007, 28(8): 965-973.
[5] 隋心怡, 王瑞刚, 张鸿翔. 一种改进的K-均值聚类算法[J]. 计算机与数字工程, 2018, 46(4): 682-685.
SUI X Y, WAGN R G, ZHANG H X. An improved K-means clustering algorithm[J]. Computer & Digital Engineering, 2018, 46(4): 682-685.
[6] 邵伦, 周新志, 赵成萍, 等. 基于多维网格空间的改进K-means聚类算法[J]. 计算机应用, 2018, 38(10): 2850- 2855.
SHAO L, ZHOU X Z, ZHAO C P, et al. Improved K-means clustering algorithm based on multi-dimensional grid space[J]. Journal of Computer Applications, 2018, 38(10): 2850-2855.
[7] 廖纪勇, 吴晟, 刘爱莲. 基于相异性度量选取初始聚类中心改进的K-means聚类算法[J]. 控制与决策, 2021, 36(12): 3083-3090.
LIAO J Y, WU S, LIU A L. Improved K-means clustering algorithm for selecting initial clustering centers based on dissimilarity measure[J]. Control and Decision, 2021, 36(12): 3083-3090.
[8] 黄华娟, 闵峰. 求解逆运动学的多策略蜻蜓算法[J]. 河南师范大学学报(自然科学版), 2023, 51(5): 46-58.
HUANG H J, MIN F. Multi-strategy dragonfly algorithm for solving inverse kinematics[J]. Journal of Henan Normal University (Natural Science Edition), 2023, 51(5): 46-58.
[9] 宋飞, 夏克文, 杨文彪. 融合多策略的鸟群算法及油层识别ELM模型优化[J]. 计算机工程与应用, 2022, 58(9): 279-287.
SONG F, XIA K W, YANG W B. Mix with multiple strategies bird swarm algorithm and optimization of ELM model in oil layer classification[J]. Computer Engineering and Applications, 2022, 58(9): 279-287.
[10] 许文杰, 欧宜贵. 基于神经动力系统求解广义非线性互补问题的优化方法[J]. 河南师范大学学报(自然科学版), 2022, 50(6): 87-95.
XU W J, OU Y G. An approach to general nonlinear complementarity problems based on neurodynamic system[J]. Journal of Henan Normal University (Natural Science Edition), 2022, 50(6): 87-95.
[11] 孙林, 李梦梦, 徐久成. 二进制哈里斯鹰优化及其特征选择算法[J]. 计算机科学, 2023, 50(5): 277-291.
SUN L, LI M M, XU J C. Binary Harris hawk optimization and its feature selection algorithm[J]. Computer Science, 2023, 50(5): 277-291.
[12] 刘琨, 封硕. 加强局部搜索能力的人工蜂群算法[J]. 河南师范大学学报(自然科学版), 2021, 49(2): 15-24.
LIU K, FENG S. An improved artificial bee colony algorithm for enhancing local search ability[J]. Journal of Henan Normal University (Natural Science Edition), 2021, 49(2): 15-24.
[13] KARABOGA D. An idea based on honey bee swarm for numerical optimization: TR06[R]. Erciyes University, 2005.
[14] 马韦伟, 郑勤红, 刘珊珊. 基于蜂群优化的Spiking神经网络模型研究与评估[J]. 计算机科学, 2023, 50(8): 221- 225.
MA W W, ZHENG Q H, LIU S S. Study and evaluation of spiking neural network model based on bee colony optimization[J]. Computer Science, 2023, 50(8): 221-225.
[15] 叶廷宇, 叶军, 王晖, 等. 结合人工蜂群优化的粗糙K-means聚类算法[J]. 计算机科学与探索, 2022, 16(8): 1923-1932.
YE T Y, YE J, WANG H, et al. Rough K-means clustering algorithm combined with artificial bee colony optimization[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(8): 1923-1932.
[16] 胡中源, 薛羽, 查加杰. 演化循环神经网络研究综述[J]. 计算机科学, 2023, 50(3): 254-265.
HU Z Y, XUE Y, ZHA J J. Survey on evolutionary recurrent neural networks[J]. Computer Science, 2023, 50(3): 254-265.
[17] 李冰晓, 万睿之, 朱永杰, 等. 基于种群分区的多策略综合粒子群优化算法[J]. 河南师范大学学报(自然科学版), 2022, 50(3): 85-94.
LI B X, WAN R Z, ZHU Y J, et al. Multi-strategy comprehensive particle swarm optimization algorithm based on population partition[J]. Journal of Henan Normal University (Natural Science Edition), 2022, 50(3): 85-94.
[18] JANANI R, VIJAYARANI S. Text document clustering using artificial bee colony with bisecting [K-means ]algorithm[J]. International Journal of Advanced Research in Computer Science, 2018, 9(1): 619-623.
[19] JIN Q, LIN N, ZHANG Y. [K-means ] clustering algorithm based on chaotic adaptive artificial bee colony[J]. Algorithms, 2021, 14(2): 53.
[20] 曹永春, 蔡正琦, 邵亚斌. 基于[K-means ]的改进人工蜂群聚类算法[J]. 计算机应用, 2014, 34(1): 204-207.
CAO Y C, CAI Z Q, SHAO Y B. Improved artificial bee colony clustering algorithm based on [K-means[J].]Journal of Computer Applications, 2014, 34(1): 204-207.
[21] 谢娟英, 丁丽娟, 王明钊. 基于谱聚类的无监督特征选择算法[J]. 软件学报, 2020, 31(4): 1009-1024.
XIE J Y, DING L J, WANG M Z. Spectral clustering based unsupervised feature selection algorithms[J]. Journal of Software, 2020, 31(4): 1009-1024.
[22] DU Z, HAN D, LI K C. Improving the performance of feature selection and data clustering with novel global search and elite-guided artificial bee colony algorithm[J]. The Journal of Supercomputing, 2019, 75(8): 5189-5226.
[23] MOSLEHI F, HAERI A. A novel feature selection approach based on clustering algorithm[J]. Journal of Statistical Computation and Simulation, 2021, 91(3): 581-604.
[24] TANG X, DONG M, BI S, et al. Feature selection algorithm based on k-means clustering[C]//Proceedings of the 2017 IEEE 7th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems. Piscataway: IEEE, 2017: 1522- 1527.
[25] 胡敏杰, 郑荔平, 唐莉, 等. 联合谱聚类与邻域互信息的特征选择算法[J]. 模式识别与人工智能, 2017, 30(12): 1121-1129．
HU M J, ZHENG L P, TANG L, et al. Feature selection algorithm based on joint spectral clustering and neighborhood mutual information[J]. Pattern Recognition and Artificial Intelligence, 2017, 30(12): 1121-1129.
[26] 贺思云, 高建瓴, 陈岚. 基于改进人工蜂群算法的k-means聚类算法[J]. 贵州大学学报(自然科学版), 2017, 34(5): 83-87.
HE S Y, GAO J L, CHEN L. k-means clustering algorithm based on improved artificial bee colony algorithm[J]. Journal of Guizhou University (Natural Sciences), 2017, 34(5): 83-87.
[27] 刘川川, 丁海军. 一种基于改进人工蜂群的K-means聚类算法[J]. 微处理机, 2016, 37(2): 47-50.
LIU C C, DING H J. A K-means clustering algorithm based on improved artificial bee colony[J]. Microprocessors, 2016, 37(2): 47-50.
[28] CAI D, ZHANG C, HE X. Unsupervised feature selection for multi-cluster data[C]//Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, Jul 25-28, 2010. New York: ACM, 2010: 333-342.
[29] ZENG H, CHEUNG Y. Feature selection and kernel learning for local learning-based clustering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 33(8): 1532-1547.
[30] YANG Y, SHEN H T, MA Z, et al. [l2,1]-norm regularized discriminative feature selection for unsupervised[C]//Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Jul 16-22, 2011: 1589-1594.
[31] ZHAO Z, LIU H. Spectral feature selection for supervised and unsupervised learning[C]//Proceedings of the 24th International Conference on Machine Learning, Corvalis Oregon, Jun 20-24, 2007: 1151-1157.
[32] YAN X, NAZMI S, EROL B A, et al. An efficient unsu-pervised feature selection procedure through feature clustering[J]. Pattern Recognition Letters, 2020, 131: 277-284.
[33] 张宇姣, 黄锐, 张福泉, 等. 基于菌群优化的近邻传播聚类算法研究[J]. 计算机科学, 2022, 49(5): 165-169.
ZHANG Y J, HUANG R, ZHANG F Q, et al. Study on affinity propagation clustering algorithm based on bacterial flora optimization[J]. Computer Science, 2022, 49(5): 165- 169.
[34] ESTER M, KRIEGEL H P, SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Aug 2-4, 1996: 226-231.
[35] ANKERST M, BREUNIG M, KRIEGEL H P, et al. Ordering points to identify the clustering structure[C]//Proceedings of the 1999 ACM SIGMOD Record, Philadelphia, Jun 1-3, 1999: 49-60.
[36] 孙林, 刘梦含, 徐久成. 基于优化初始聚类中心和轮廓系数的K-means聚类算法[J]. 模糊系统与数学, 2022, 36(1): 47-65.
SUN L, LIU M H, XU J C. K-means clustering algorithm using optimal initial clustering center and contour coefficient[J]. Fuzzy Systems and Mathematics, 2022, 36(1): 47- 65.
[37] TELLAROLI P, BAZZI M, DONATO M, et al. Cross-clustering: a partial clustering algorithm with automatic estimation of the number of clusters[J]. PLoS One, 2016, 11(3): e0152333.
[38] 傅文渊, 凌朝东. 自适应折叠混沌优化方法[J]. 西安交通大学学报, 2013, 47(2): 33-38.
FU W Y, LING C D. An adaptive iterative chaos optimization method [J]. Journal of Xi’an Jiaotong University, 2013, 47(2): 33-38.
[39] 孙林, 梁娜, 徐久成. 基于自适应邻域互信息与谱聚类的特征选择[J]. 山东大学学报(理学版), 2022, 57(12): 13-24.
SUN L, LIANG N, XU J C. Feature selection using adaptive neighborhood mutual information and spectral clustering[J]. Journal of Shandong University (Natural Science), 2022, 57(12): 13-24.