基于加权网格和信息熵的并行密度聚类算法

doi:10.3778/j.issn.1673-9418.1912034

摘要/Abstract

摘要：

针对大数据下基于密度的聚类算法中存在的数据网格划分不合理，聚类结果准确度不高以及并行化效率较低等问题，提出了基于MapReduce和加权网格信息熵的DBWGIE-MR算法。首先提出自适应网格划分策略（ADG）来划分网格单元；其次提出邻居网格扩展策略（NE）用于构建每个数据分区的加权网格，以此提高聚类效果；同时提出加权网格信息熵策略（WGIE）来计算网格密度以及密度聚类算法的[ε]邻域和核心对象，使密度聚类算法更适用于加权网格；接着结合MapReduce计算模型，提出并行计算局部簇算法（COMCORE-MR），从而加快获取局部簇；最后提出了基于并查集的并行合并局部簇算法（MECORE-MR），用于加快合并局部簇的收敛速度，提升了基于密度的聚类算法对局部簇合并的效率。实验结果表明，DBWGIE-MR算法的聚类效果更佳，且在较大规模的数据集下算法的并行化性能更好。

关键词: 大数据, 密度聚类, 加权网格, 信息熵

Abstract:

Aiming at the problems of unreasonable division of data gridding, low accuracy of clustering results and low efficiency of parallelization in big data clustering algorithm based on density, this paper proposes a density-based clustering algorithm by using weighted grid and information entropy based on MapReduce, named DBWGIE-MR. Firstly, an adaptive division grid (ADG) strategy is proposed to divide the cell of grid adaptively. Secondly, a weighted grid construction strategy, neighboring expand (NE) which can strengthen relevance between grids is designed to improve the accuracy of clustering. Meanwhile, based on weighted grid and information entropy (WGIE), a density calculation strategy is designed to calculate the density of grid. In addition, the ε-neighborhood and core object of density-based clustering algorithm are recalculated, which is suitable for weighted grid. Then, COMCORE-MR (core clusters computing algorithm based on MapReduce) algorithm is proposed to compute the local clusters of clustering algorithm in parallel. Finally, based on disjoint-set and MapReduce, MECORE-MR (merge core cluster by using MapReduce) algorithm is proposed to speed up the convergence speed of merging local clusters, which improves the local clusters merging efficiency of density-based clustering algorithm. The experimental results show that the DBWGIE-MR algorithm has better clustering results and performs better parallelization in large scale dataset.

Key words: big data, density-based clustering algorithm, weighted grid, information entropy

胡健，徐锴滨，毛伊敏. 基于加权网格和信息熵的并行密度聚类算法[J]. 计算机科学与探索, 2020, 14(12): 2094-2107.

HU Jian, XU Kaibin, MAO Yimin. Parallel Density-Based Clustering Algorithm by Using Weighted Grid and Information Entropy[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(12): 2094-2107.

参考文献

[1] Chen M S, Han J W, Yu P S. Data mining: an overview from a database perspective[J]. IEEE Transactions on Knowledge and Data Engineering, 1996, 8(6): 866-883.
[2] Ester M, Kriegel H, Sander J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Aug 2-4, 1996. Menlo Park: AAAI, 1996: 226-231.
[3] Ankerst M, Breunig M M, Kriegel H P, et al. OPTICS: ordering points to identify the clustering structure[C]//Proceedings of the 1999 ACM SIGMOD International Conference on Manag-ement of Data, Philadelphia, Jun 1-3, 1999. New York: ACM, 1999: 49-60.
[4] Jin H, Qian X Z. Optimized density peak clustering algorithm by natural nearest neighbor[J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(4): 711-720.金辉, 钱雪忠. 自然最近邻优化的密度峰值聚类算法[J]. 计算机科学与探索, 2019, 13(4): 711-720.
[5] Li W J, Yan S Q, Jiang Y, et al. Research on method of self-adaptive dertermination of DBSCAN algorithm parameters[J]. Computer Engineering and Applications, 2019, 55(5): 1-7.李文杰, 闫世强, 蒋莹, 等. 自适应确定DBSCAN算法参数的算法研究[J]. 计算机工程与应用, 2019, 55(5): 1-7.
[6] Hu J, Zhu H W, Mao Y M. DBSCAN clustering algorithm based on adaptive bee colony optimization[J]. Computer Engineering and Applications, 2019, 55(14): 105-114.胡健, 朱海湾, 毛伊敏. 基于自适应蜂群优化的DBSCAN聚类算法[J]. 计算机工程与应用, 2019, 55(14): 105-114.
[7] Wang S, Wang H J, Qin X P, et al. Architecting big data: challenges, studies and forecasts[J]. Chinese Journal of Com-puters, 2011, 34(10): 5-16.王珊, 王会举, 覃雄派, 等. 架构大数据: 挑战、现状与展望[J]. 计算机学报, 2011, 34(10): 5-16.
[8] Wang W L, Zhang Z J, Gao N, et al. Progress of big data analytics methods based on artificial intelligence technology[J]. Computer Integrated Manufacturing Systems, 2019, 25(3): 529-547.王万良, 张兆娟, 高楠, 等. 基于人工智能技术的大数据分析方法研究进展[J]. 计算机集成制造系统, 2019, 25(3): 529-547.
[9] Song J, Sun Z Z, Mao K M, et al. Research advance on Map-Reduce based big data processing platforms and algorithms [J]. Journal of Software, 2017, 28(3): 514-543.宋杰, 孙宗哲, 毛克明, 等. MapReduce大数据处理平台与算法研究进展[J]. 软件学报, 2017, 28(3): 514-543.
[10] Hu X Q, Wu X, Wen L J, et al. Parallel distributed process mining algorithm based on Spark[J]. Computer Integrated Manufacturing Systems, 2019, 25(4): 791-797.胡小强, 吴翾, 闻立杰, 等. 基于Spark的并行分布式过程挖掘算法[J]. 计算机集成制造系统, 2019, 25(4): 791-797.
[11] Wu X D, Zhu X Q, Wu G Q, et al. Data mining with big data[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(1): 97-107.
[12] Zhang Y F, Chen S M, Yu G. Efficient distributed density peaks for clustering large data sets in MapReduce[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(12): 3218-3230.
[13] Aljumaily H, Laefer D F, Cuadra D. Urban point cloud mining based on density clustering and MapReduce[J]. Journal of Com-puting in Civil Engineering, 2017, 31(5): 1-11.
[14] Yu Y W, Zhao J D, Wang X D, et al. Cludoop: an efficient distributed density-based clustering for big data using hadoop[J]. International Journal of Distributed Sensor Networks, 2015, 11: 579391.
[15] Li L J, Xi Y. Research on clustering algorithm and its para-llelization strategy[C]//Proceedings of the 2011 International Conference on Computational and Information Sciences, Chengdu, Oct 21-23, 2011. Washington: IEEE Computer Society, 2011: 325-328.
[16] da Silva T L C, Neto A C A, Magalh?es R P, et al. Towards an efficient and distributed DBSCAN algorithm using Map-Reduce[C]//Proceedings of the 16th International Conference on Enterprise Information Systems, Lisbon, Apr 27-30, 2014. Berlin, Heidelberg: Springer, 2014: 75-90.
[17] Noticewala M, Vaghela D. MR-IDBSCAN: efficient parallel incremental DBSCAN algorithm using MapReduce[J]. Inter-national Journal of Computer Applications, 2014, 93(4): 13-18.
[18] Qu Y, Deng W B, Hu F, et al. Algorithm for ordering points to identify clustering structure based on Spark[J]. Computer Science, 2018, 45(1): 97-102.瞿原, 邓维斌, 胡峰, 等. 基于Spark的点排序识别聚类结构算法[J]. 计算机科学, 2018, 45(1): 97-102.
[19] Hosseini B, Kiani K. A robust distributed big data clustering-based on adaptive density partitioning using apache Spark[J]. Symmetry, 2018, 10(8): 342.
[20] Guha S, Rastogi R, Shim K. Cure: an efficient clustering algorithm for large databases[J]. Information Systems, 2001, 26(1): 35-58.
[21] He Y B, Tan H Y, Luo W M, et al. MR-DBSCAN: an efficient parallel density-based clustering algorithm using MapReduce[C]//Proceedings of the 17th IEEE International Conference on Parallel and Distributed Systems, Tainan, China, Dec 7-9, 2011. Washington: IEEE Computer Society, 2011: 473-480.
[22] Song D F, Xu H. Research and parallelization of DBSCAN algorithm[J]. Computer Engineering and Applications, 2018, 54(24): 52-56.宋董飞, 徐华. DBSCAN算法研究及并行化实现[J].计算机工程与应用, 2018, 54(24): 52-56.
[23] Huang F, Zhu Q, Zhou J, et al. Research on the paralleli-zation of the DBSCAN clustering algorithm for spatial data mining based on the Spark platform[J]. Remote Sensing, 2017, 9(12): 1301.
[24] Wang X, Wu Y, Jiang X H, et al. Incremental parallelization of fast clustering based on DBSCAN algorithm under large-scale data set[J]. Computer Applications and Software, 2018, 35(4): 269-275.王兴, 吴艺, 蒋新华, 等. 大规模数据集下基于DBSCAN算法的增量并行化快速聚类[J]. 计算机应用与软件, 2018, 35(4): 269-275.
[25] He Z. The study of the weighted average density self-adaptive clustering algorithm based on grid and its application[D].Changsha: Hunan University, 2012.贺庄. 基于网格的加权平均密度自适应聚类算法及其应用研究[D]. 长沙: 湖南大学, 2012.
[26] Wang W Q, Wang D, Singh V P, et al. Evaluation of information transfer and data transfer models of rain-gauge network design based on information entropy[J]. Environment Research, 2019, 178: 108686.
[27] Cormen T H, Leiserson C E, Rivest R L, et al. Introduction to algorithms[M]. 3rd ed. Cambridge: MIT Press, 2009.
[28] Kim Y, Shim K, Kim M S, et al. DBCURE-MR: an efficient density-based clustering algorithm for large data using Map-Reduce[J]. Information Systems, 2014, 42: 15-35.
[29] Fu L M, Medico E. FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data[J]. BMC Bioin-formatics, 2007, 8: 3.
[30] Zahn C T. Graph-theoretical methods for detecting and des-cribing gestalt clusters[J]. IEEE Transactions on Computers, 1970, 20(1): 68-86.
[31] Gionis A, Mannila H, Tsaparas P. Clustering aggregation[J]. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): 4.
[32] Veenman C J, Reinders M J T, Backer E. A maximum var-iance cluster algorithm[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(9): 1273-1280.