计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (1): 93-110.DOI: 10.3778/j.issn.1673-9418.2212075

• 理论·算法 • 上一篇    下一篇

结合人工蜂群与K-means聚类的特征选择

孙林,刘梦含,薛占熬   

  1. 1. 天津科技大学 人工智能学院,天津 300457
    2. 河南师范大学 计算机与信息工程学院,河南 新乡 453007
  • 出版日期:2024-01-01 发布日期:2024-01-01

Feature Selection Combining Artificial Bee Colony with [K-means] Clustering

SUN Lin, LIU Menghan, XUE Zhan’ao   

  1. 1. College of Artificial Intelligence, Tianjin University of Science and Technology, Tianjin 300457, China
    2. School of Computer and Information Engineering, Henan Normal University, Xinxiang, Henan 453007, China
  • Online:2024-01-01 Published:2024-01-01

摘要: K-means聚类是一种简捷高效、收敛速度快且易于实现的统计分析方法,但是传统的[K-means]聚类算法对初始聚类中心的选取敏感且易陷入局部最优,同时多数无监督特征选择算法容易忽视特征之间的联系。为此,提出了一种结合人工蜂群与[K-means]聚类的特征选择方法。首先,为了使同一簇中样本的相似度高而不同簇中样本的相似度低,基于簇内聚集度和簇间离散度构建了新的适应度函数,更好地反映各样本的特性,进而构建了蜜源被选择新的概率表达式;其次,设计了随着迭代次数的增加而数值逐渐减小的权重,提出了使蜂群搜索范围动态缩进的蜜源位置更新表达式;然后,为了弥补传统的欧氏距离在计算距离时仅考虑向量之间的累积差异而表现出的局限性,构造了同时考虑样本影响程度不同以及样本的相似性的加权欧氏距离表达式;最后,引入标准差和距离相关系数,定义了特征区分度与特征代表性,以二者之积度量特征重要性。实验结果表明,所提算法加快了人工蜂群算法的收敛速度并提高了[K-means]算法的聚类效果,同时也有效地提升了特征选择的分类效果。

关键词: 特征选择, 人工蜂群, [K-means]聚类, 特征重要度

Abstract: K-means clustering is a simple and efficient, fast in convergence and easy to implement statistical analysis method. However, the traditional [K-means] clustering algorithm is sensitive to the selection of initial clustering centers and easy to fall into a local optimum, and at the same time, most unsupervised feature selection algorithms are easy to ignore the relationship between features. To solve the above issues, this paper proposes a feature selection algorithm combining artificial bee colony with [K-means] clustering. Firstly, to make the similarity of samples in the same cluster high and the similarity of the samples in different clusters low, a new fitness function is constructed based on the clustering degree within the cluster and the dispersion degree between the clusters, which can better reflect the characteristics of each sample, and then a new probability expression of the honey source being selected is constructed. Secondly, the weight which decreases gradually with the increase of the number of iterations is designed, and the honey source location update expression that makes the search range of the bee colony dynamically indent is proposed. Thirdly, to make up for the limitation of the traditional Euclidean distance which only considers the cumulative difference between vectors when calculating the distance, a weighted Euclidean distance expression which simultaneously considers both the different influence degrees of the samples and the similarity of the samples is constructed. Finally, the standard deviation and distance correlation coefficient are introduced to define feature discrimination and feature representativeness, and the product of them is used to measure the importance of features. Experimental results show that the proposed algorithm accelerates the convergence speed of artificial bee colony algorithm and improves the clustering effect of [K-means] algorithm, and also effectively improves the classification effect of feature selection.

Key words: feature selection, artificial bee colony, [K-means] clustering, feature importance