计算机科学与探索 ›› 2021, Vol. 15 ›› Issue (7): 1289-1301.DOI: 10.3778/j.issn.1673-9418.2004074

• 人工智能 • 上一篇    下一篇

密度Canopy的增强聚类与深度特征的KNN算法

沈学利,秦鑫宇   

  1. 1. 辽宁工程技术大学 软件学院,辽宁 葫芦岛 125105
    2. 中国科学院海西研究院 泉州装备制造所,福建 泉州 362216
  • 出版日期:2021-07-01 发布日期:2021-07-09

KNN Algorithm of Enhanced Clustering Based on Density Canopy and Deep Feature

SHEN Xueli, QIN Xinyu   

  1. 1. College of Software, Liaoning Technical University, Huludao, Liaoning 125105, China
    2. Quanzhou Equipment Manufacturing Research Institute, Haixi Research Institute, Chinese Academy of Sciences, Quanzhou, Fujian 362216, China
  • Online:2021-07-01 Published:2021-07-09

摘要:

K最近邻(KNN)算法作为目前使用最广泛的有监督分类算法,在大规模、多维度数据的处理方面往往是低效的,因此提出了一种适用于高维度大数据量处理的改进KNN算法。首先采用深度神经网络(DNN)作为特征提取器并进行降维,以学习到最合适的深度特征表示形式;然后通过密度Canopy算法获取到合适的集群数和初始聚类中心,成为之后K-means聚类的输入参数;最后对学习到的数据进行聚类,并采用近似相似性搜索(ASS)中的Hashing策略按其近似相似度进行集群划分,将结果作为KNN分类器的新训练样本。考虑到要查询的最近邻样本可能落在不同集群之中,导致KNN搜索的性能下降,在聚类时额外采用了一种聚类增强策略,有效缓解了这种情况的发生。使用五个不同的数据集进行对比测试,结果表明:与实验对比的算法相比,该算法不仅能够极大地提高KNN的分类精度,而且有效地提升了算法的分类效率,减少了搜索所需的距离数,对噪声数据还具有良好的鲁棒性。

关键词: K最近邻(KNN), 密度Canopy, 增强聚类, 深度神经网络(DNN), 近似相似性搜索(ASS)

Abstract:

As the most widely used supervised classification algorithm, K nearest neighbor (KNN) algorithm is often inefficient in the processing of large-scale and multidimensional data. Therefore, an improved KNN algorithm for high dimensional large data processing is proposed. Firstly, deep neural networks (DNN) is used as feature extractor and dimension reduction is carried out to learn the most appropriate representation form of depth feature. Then, the appropriate number of clusters and the initial clustering center, obtained through the density Canopy algorithm become the input parameters of the subsequent K-means clustering. Finally, the learned data are clustered, and Hashing strategy in approximate similarity search (ASS) is used to cluster partitioning according to its approximate similarity, and the result is token as a new training sample of KNN classifier. In addition, considering that the nearest neighbor samples to be searched may fall in different clusters and the performance of KNN search is reduced, an additional clustering enhancement strategy is adopted in clustering, which effectively alleviates this situation. Five different data sets are used for comparison test. The results show that, compared with the experimental algorithms, this algorithm can not only greatly improve the accuracy of KNN classification, but also effectively improves the classification efficiency of the algorithm, reduces the distance required for searching, and has good robustness for noise data.

Key words: K nearest neighbor (KNN), density Canopy, enhanced clustering, deep neural networks (DNN), approxi-mate similarity search (ASS)