密度Canopy的增强聚类与深度特征的KNN算法

doi:10.3778/j.issn.1673-9418.2004074

摘要/Abstract

摘要：

K最近邻（KNN）算法作为目前使用最广泛的有监督分类算法，在大规模、多维度数据的处理方面往往是低效的，因此提出了一种适用于高维度大数据量处理的改进KNN算法。首先采用深度神经网络（DNN）作为特征提取器并进行降维，以学习到最合适的深度特征表示形式；然后通过密度Canopy算法获取到合适的集群数和初始聚类中心，成为之后K-means聚类的输入参数；最后对学习到的数据进行聚类，并采用近似相似性搜索（ASS）中的Hashing策略按其近似相似度进行集群划分，将结果作为KNN分类器的新训练样本。考虑到要查询的最近邻样本可能落在不同集群之中，导致KNN搜索的性能下降，在聚类时额外采用了一种聚类增强策略，有效缓解了这种情况的发生。使用五个不同的数据集进行对比测试，结果表明：与实验对比的算法相比，该算法不仅能够极大地提高KNN的分类精度，而且有效地提升了算法的分类效率，减少了搜索所需的距离数，对噪声数据还具有良好的鲁棒性。

关键词: K最近邻（KNN）, 密度Canopy, 增强聚类, 深度神经网络（DNN）, 近似相似性搜索（ASS）

Abstract:

As the most widely used supervised classification algorithm, K nearest neighbor (KNN) algorithm is often inefficient in the processing of large-scale and multidimensional data. Therefore, an improved KNN algorithm for high dimensional large data processing is proposed. Firstly, deep neural networks (DNN) is used as feature extractor and dimension reduction is carried out to learn the most appropriate representation form of depth feature. Then, the appropriate number of clusters and the initial clustering center, obtained through the density Canopy algorithm become the input parameters of the subsequent K-means clustering. Finally, the learned data are clustered, and Hashing strategy in approximate similarity search (ASS) is used to cluster partitioning according to its approximate similarity, and the result is token as a new training sample of KNN classifier. In addition, considering that the nearest neighbor samples to be searched may fall in different clusters and the performance of KNN search is reduced, an additional clustering enhancement strategy is adopted in clustering, which effectively alleviates this situation. Five different data sets are used for comparison test. The results show that, compared with the experimental algorithms, this algorithm can not only greatly improve the accuracy of KNN classification, but also effectively improves the classification efficiency of the algorithm, reduces the distance required for searching, and has good robustness for noise data.

Key words: K nearest neighbor (KNN), density Canopy, enhanced clustering, deep neural networks (DNN), approxi-mate similarity search (ASS)

沈学利, 秦鑫宇. 密度Canopy的增强聚类与深度特征的KNN算法[J]. 计算机科学与探索, 2021, 15(7): 1289-1301.

SHEN Xueli, QIN Xinyu. KNN Algorithm of Enhanced Clustering Based on Density Canopy and Deep Feature[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(7): 1289-1301.

参考文献

[1] COVER T, HART P. Nearest neighbor pattern classification[J]. IEEE Transactions on Information Theory, 1967, 13(1): 21-27.
[2] CHEN M, HE S P, LI F Z. Research on Finsler metric in KNN algorithm[J]. Journal of Frontiers of Computer Science and Technology, 2011, 5(11): 1021-1026.
陈明, 何书萍, 李凡长. Finsler度量在KNN算法中的应用研究[J]. 计算机科学与探索, 2011, 5(11): 1021-1026.
[3] XIE H, ZHAO H Y. An improved KNN algorithm based on chi-square distance measure[J]. Applied Science and Tech-nology, 2015(1): 10-14.
谢红, 赵洪野. 基于卡方距离度量的改进KNN算法[J]. 应用科技, 2015(1): 10-14.
[4] ZHANG Y, DU H L. Imbalanced heterogeneous data ense-mble classification based on HVDM-KNN[J]. CAAI Trans-actions on Intelligent Systems, 2019, 14(4): 733-742.
张燕, 杜红乐. 基于异构距离的集成分类算法研究[J]. 智能系统学报, 2019, 14(4): 733-742.
[5] WANG Z Q, HE J W, JIANG L X. New redundancy-based algorithm for reducing amount of training examples in KNN[J]. Computer Engineering and Applications, 2019, 55(22): 40-45.
王子旗, 何锦雯, 蒋良孝. 基于冗余度的 KNN 训练样本裁剪新算法[J]. 计算机工程与应用, 2019, 55(22): 40-45.
[6] ZHANG S. Cost-sensitive KNN classification[J]. Neuro-computing, 2020, 391: 234-242.
[7] ZHAO L, XING Z Y. Design of cutting k-nearest neighbor classification algorithm based on the most important feature search[J]. Electronic Design Engineering, 2019, 27(14): 135-138.
赵琳, 行致源. 基于最重要特征的裁剪k-近邻分类算法设计[J]. 电子设计工程, 2019, 27(14): 135-138.
[8] ZHAO J J, SHENG J F, TAO X M. A KNN algorithm in text classification based on feature weighting[J]. Computer Study, 2010(2): 84-86.
赵俊杰, 盛剑锋, 陶新民. 一种基于特征加权的KNN文本分类算法[J]. 电脑学习, 2010(2): 84-86.
[9] XIAO J. SVM and KNN ensemble learning for traffic incident detection[J]. Physica A: Statistical Mechanics and Its App-lications, 2019, 517: 29-35.
[10] SONG Y, GU Y, ZHANG R, et al. BrePartition: optimized high-dimensional kNN search with Bregman distances[J]. arXiv:2006.00227, 2020.
[11] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553): 436-444.
[12] RUMELHART D E, HINTON G E, WILLIAMS R J. Lear-ning representations by back propagating errors[J]. Nature, 1986, 323(6088): 533-536.
[13] BABENKO A, SLESAREV A, CHIGORIN A, et al. Neural codes for image retrieval[C]//LNCS 8689: Proceedings of the 13th European Conference on Computer Vision, Zurich, Sep 6-12, 2014. Berlin, Heidelberg: Springer, 2014: 584-599.
[14] TANG Y. Deep learning using linear support vector machines[J]. arXiv:1306.0239, 2013.
[15] MCCALLUM A, NIGAM K, UNGAR L H. Efficient clustering of high-dimensional data sets with application to reference matching[C]//Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Aug 20-23, 2000. New York: ACM, 2000: 169-178.
[16] HAN J, KAMBER M. Data mining: concepts and techniques[M]. Amsterdam: Elsevier, 2006.
[17] WANG J, SHEN H T, SONG J, et al. Hashing for similarity search: a survey[J]. arXiv:1408.2927, 2014.
[18] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recogni-tion, Las Vegas, Jun 27-30, 2016. Washington: IEEE Com-puter Society, 2016: 770-778.
[19] ROUSSEEUW P J. Silhouettes: a graphical aid to the inter-pretation and validation of cluster analysis[J]. Journal of Computational and Applied Mathematics, 1987, 20: 53-65.
[20] ROSENBERG A, HIRSCHBERG J. V-Measure: a conditional entropy-based external cluster evaluation[C]//Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Jun 28-30, 2007. Stroudsburg: ACL, 2007: 410-420.