计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (1): 96-107.DOI: 10.3778/j.issn.1673-9418.1902023

• 人工智能 • 上一篇    下一篇

新的降维标准下的高维数据聚类算法

万静,吴凡,何云斌,李松   

  1. 哈尔滨理工大学 计算机科学与技术学院,哈尔滨 150080
  • 出版日期:2020-01-01 发布日期:2020-01-09

Clustering Algorithm for High-Dimensional Data Under New Dimensionality Reduc-tion Criteria

WAN Jing, WU Fan, HE Yunbin, LI Song   

  1. School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, China
  • Online:2020-01-01 Published:2020-01-09

摘要: 为了解决主成分分析(PCA)算法无法处理高维数据降维后再聚类精确度下降的问题,提出了一种新的属性空间概念,通过属性空间与信息熵的结合构建了基于特征相似度的降维标准,提出了新的降维算法EN-PCA。针对降维后特征是原特征的线性组合而导致可解释性变差以及输入不够灵活的问题,提出了基于岭回归的稀疏主成分算法(ESPCA)。ESPCA算法的输入为主成分降维结果,不需要迭代获得稀疏结果,增加了灵活性和求解速度。最后在降维数据的基础上,针对遗传算法聚类收敛速度慢等问题,对遗传算法的初始化、选择、交叉、变异等操作进行改进,提出了新的聚类算法GKA++。实验分析表明EN-PCA算法表现稳定,GKA++算法在聚类有效性和效率方面表现良好。

关键词: 聚类, 主成分分析(PCA), 特征相似度, 岭回归, 遗传算法

Abstract: In order to solve the problem that principal component analysis (PCA) algorithm can't deal with the reduction of clustering accuracy after high dimensional data reduction, a new attribute space concept is proposed. Based on the combination of attribute space and information entropy, the dimensionality reduction standard based on feature similarity is constructed. A new dimensionality reduction algorithm (entropy-PCA, EN-PCA) is proposed. Aiming at the problem that the post-dimension feature is a linear combination of original features, which leads to poor interpretability and inflexible input, a sparse principal component algorithm based on ridge regression (ESPCA) is proposed. The input of ESPCA algorithm is the PCA dimension reduction result. It does not require iteration to obtain sparse results, which increases the flexibility and speed of solution. Finally, on the basis of dimensionality reduction data, initialization, selection, crossover, mutation and other operations are improved for the problem of slow convergence of genetic algorithm clustering, and a new clustering algorithm (genetic K-means algorithm ++, GKA++) is proposed. Experimental analysis shows that the EN-PCA algorithm is stable, and the GKA++ algorithm performs well in terms of clustering effectiveness and efficiency.

Key words: clustering, principal component analysis (PCA), feature similarity, ridge regression, genetic algorithm