新的降维标准下的高维数据聚类算法

doi:10.3778/j.issn.1673-9418.1902023

计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (1): 96-107.DOI: 10.3778/j.issn.1673-9418.1902023

新的降维标准下的高维数据聚类算法

万静，吴凡，何云斌，李松

哈尔滨理工大学计算机科学与技术学院，哈尔滨 150080

出版日期:2020-01-01 发布日期:2020-01-09

Clustering Algorithm for High-Dimensional Data Under New Dimensionality Reduc-tion Criteria

WAN Jing, WU Fan, HE Yunbin, LI Song

School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, China

Online:2020-01-01 Published:2020-01-09

摘要/Abstract

摘要： 为了解决主成分分析（PCA）算法无法处理高维数据降维后再聚类精确度下降的问题，提出了一种新的属性空间概念，通过属性空间与信息熵的结合构建了基于特征相似度的降维标准，提出了新的降维算法EN-PCA。针对降维后特征是原特征的线性组合而导致可解释性变差以及输入不够灵活的问题，提出了基于岭回归的稀疏主成分算法（ESPCA）。ESPCA算法的输入为主成分降维结果，不需要迭代获得稀疏结果，增加了灵活性和求解速度。最后在降维数据的基础上，针对遗传算法聚类收敛速度慢等问题，对遗传算法的初始化、选择、交叉、变异等操作进行改进，提出了新的聚类算法GKA++。实验分析表明EN-PCA算法表现稳定，GKA++算法在聚类有效性和效率方面表现良好。

关键词: 聚类, 主成分分析（PCA）, 特征相似度, 岭回归, 遗传算法

Abstract: In order to solve the problem that principal component analysis (PCA) algorithm can't deal with the reduction of clustering accuracy after high dimensional data reduction, a new attribute space concept is proposed. Based on the combination of attribute space and information entropy, the dimensionality reduction standard based on feature similarity is constructed. A new dimensionality reduction algorithm (entropy-PCA, EN-PCA) is proposed. Aiming at the problem that the post-dimension feature is a linear combination of original features, which leads to poor interpretability and inflexible input, a sparse principal component algorithm based on ridge regression (ESPCA) is proposed. The input of ESPCA algorithm is the PCA dimension reduction result. It does not require iteration to obtain sparse results, which increases the flexibility and speed of solution. Finally, on the basis of dimensionality reduction data, initialization, selection, crossover, mutation and other operations are improved for the problem of slow convergence of genetic algorithm clustering, and a new clustering algorithm (genetic K-means algorithm ++, GKA++) is proposed. Experimental analysis shows that the EN-PCA algorithm is stable, and the GKA++ algorithm performs well in terms of clustering effectiveness and efficiency.

Key words: clustering, principal component analysis (PCA), feature similarity, ridge regression, genetic algorithm

万静，吴凡，何云斌，李松. 新的降维标准下的高维数据聚类算法[J]. 计算机科学与探索, 2020, 14(1): 96-107.

WAN Jing, WU Fan, HE Yunbin, LI Song. Clustering Algorithm for High-Dimensional Data Under New Dimensionality Reduc-tion Criteria[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(1): 96-107.

[1]	陈斌，刘卫国. 基于SAC模型的改进遗传算法求解TSP问题[J]. 计算机科学与探索, 2021, 15(9): 1680-1693.
[2]	陈俊芬，张明，赵佳成，谢博鋆，李艳. 结合降噪和自注意力的深度聚类算法[J]. 计算机科学与探索, 2021, 15(9): 1717-1727.
[3]	武晓栋，刘敬浩，金杰，毛思平. 基于DT及PCA的DNN入侵检测模型[J]. 计算机科学与探索, 2021, 15(8): 1450-1458.
[4]	王大刚，丁世飞，钟锦. 基于二阶[k]近邻的密度峰值聚类算法研究[J]. 计算机科学与探索, 2021, 15(8): 1490-1500.
[5]	杨悦，王士同. 随机特征映射的四层神经网络及其增量学习[J]. 计算机科学与探索, 2021, 15(7): 1265-1278.
[6]	沈学利，秦鑫宇. 密度Canopy的增强聚类与深度特征的KNN算法[J]. 计算机科学与探索, 2021, 15(7): 1289-1301.
[7]	范瑞东，侯臣平. 鲁棒自加权的多视图子空间聚类[J]. 计算机科学与探索, 2021, 15(6): 1062-1073.
[8]	柏锷湘，罗可，罗潇. 结合自然和共享最近邻的密度峰值聚类算法[J]. 计算机科学与探索, 2021, 15(5): 931-940.
[9]	张倪妮，葛洪伟. 稳定的K-多均值聚类算法[J]. 计算机科学与探索, 2021, 15(5): 941-948.
[10]	马瑞强，宋宝燕，丁琳琳，王俊陆. 面向时间序列事件的动态矩阵聚类方法[J]. 计算机科学与探索, 2021, 15(3): 468-477.
[11]	薛红艳, 钱雪忠, 周世兵. 超簇加权的集成聚类算法[J]. 计算机科学与探索, 2021, 15(12): 2362-2373.
[12]	张培, 祝恩, 蔡志平. 单步划分融合多视图子空间聚类算法[J]. 计算机科学与探索, 2021, 15(12): 2413-2420.
[13]	潘家文, 钱谦, 伏云发, 冯勇. 最优权动态控制学习机制的多种群遗传算法[J]. 计算机科学与探索, 2021, 15(12): 2421-2437.
[14]	姚晓红, 黄恒君. 非负半监督函数型聚类方法[J]. 计算机科学与探索, 2021, 15(12): 2438-2448.
[15]	刘娟，万静. 自然反向最近邻优化的密度峰值聚类算法[J]. 计算机科学与探索, 2021, 15(10): 1888-1899.

新的降维标准下的高维数据聚类算法

Clustering Algorithm for High-Dimensional Data Under New Dimensionality Reduc-tion Criteria

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics