Journal of Frontiers of Computer Science and Technology ›› 2020, Vol. 14 ›› Issue (1): 96-107.DOI: 10.3778/j.issn.1673-9418.1902023
Previous Articles Next Articles
WAN Jing, WU Fan, HE Yunbin, LI Song
Online:
Published:
万静,吴凡,何云斌,李松
Abstract: In order to solve the problem that principal component analysis (PCA) algorithm can't deal with the reduction of clustering accuracy after high dimensional data reduction, a new attribute space concept is proposed. Based on the combination of attribute space and information entropy, the dimensionality reduction standard based on feature similarity is constructed. A new dimensionality reduction algorithm (entropy-PCA, EN-PCA) is proposed. Aiming at the problem that the post-dimension feature is a linear combination of original features, which leads to poor interpretability and inflexible input, a sparse principal component algorithm based on ridge regression (ESPCA) is proposed. The input of ESPCA algorithm is the PCA dimension reduction result. It does not require iteration to obtain sparse results, which increases the flexibility and speed of solution. Finally, on the basis of dimensionality reduction data, initialization, selection, crossover, mutation and other operations are improved for the problem of slow convergence of genetic algorithm clustering, and a new clustering algorithm (genetic K-means algorithm ++, GKA++) is proposed. Experimental analysis shows that the EN-PCA algorithm is stable, and the GKA++ algorithm performs well in terms of clustering effectiveness and efficiency.
Key words: clustering, principal component analysis (PCA), feature similarity, ridge regression, genetic algorithm
摘要: 为了解决主成分分析(PCA)算法无法处理高维数据降维后再聚类精确度下降的问题,提出了一种新的属性空间概念,通过属性空间与信息熵的结合构建了基于特征相似度的降维标准,提出了新的降维算法EN-PCA。针对降维后特征是原特征的线性组合而导致可解释性变差以及输入不够灵活的问题,提出了基于岭回归的稀疏主成分算法(ESPCA)。ESPCA算法的输入为主成分降维结果,不需要迭代获得稀疏结果,增加了灵活性和求解速度。最后在降维数据的基础上,针对遗传算法聚类收敛速度慢等问题,对遗传算法的初始化、选择、交叉、变异等操作进行改进,提出了新的聚类算法GKA++。实验分析表明EN-PCA算法表现稳定,GKA++算法在聚类有效性和效率方面表现良好。
关键词: 聚类, 主成分分析(PCA), 特征相似度, 岭回归, 遗传算法
WAN Jing, WU Fan, HE Yunbin, LI Song. Clustering Algorithm for High-Dimensional Data Under New Dimensionality Reduc-tion Criteria[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(1): 96-107.
万静,吴凡,何云斌,李松. 新的降维标准下的高维数据聚类算法[J]. 计算机科学与探索, 2020, 14(1): 96-107.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://fcst.ceaj.org/EN/10.3778/j.issn.1673-9418.1902023
http://fcst.ceaj.org/EN/Y2020/V14/I1/96
/D:/magtech/JO/Jwk3_kxyts/WEB-INF/classes/