Journal of Frontiers of Computer Science and Technology ›› 2020, Vol. 14 ›› Issue (12): 2094-2107.DOI: 10.3778/j.issn.1673-9418.1912034

Previous Articles     Next Articles

Parallel Density-Based Clustering Algorithm by Using Weighted Grid and Information Entropy

HU Jian, XU Kaibin, MAO Yimin   

  1. 1. School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou, Jiangxi 341000, China
    2. Department of Information Engineering, College of Applied Science, Jiangxi University of Science and Technology, Ganzhou, Jiangxi 341000, China
  • Online:2020-12-01 Published:2020-12-11

基于加权网格和信息熵的并行密度聚类算法

胡健徐锴滨毛伊敏   

  1. 1. 江西理工大学 信息工程学院,江西 赣州 341000
    2. 江西理工大学 应用科学学院 信息工程系,江西 赣州 341000

Abstract:

Aiming at the problems of unreasonable division of data gridding, low accuracy of clustering results and low efficiency of parallelization in big data clustering algorithm based on density, this paper proposes a density-based clustering algorithm by using weighted grid and information entropy based on MapReduce, named DBWGIE-MR. Firstly, an adaptive division grid (ADG) strategy is proposed to divide the cell of grid adaptively. Secondly, a weighted grid construction strategy, neighboring expand (NE) which can strengthen relevance between grids is designed to improve the accuracy of clustering. Meanwhile, based on weighted grid and information entropy (WGIE), a density calculation strategy is designed to calculate the density of grid. In addition, the ε-neighborhood and core object of density-based clustering algorithm are recalculated, which is suitable for weighted grid. Then, COMCORE-MR (core clusters computing algorithm based on MapReduce) algorithm is proposed to compute the local clusters of clustering algorithm in parallel. Finally, based on disjoint-set and MapReduce, MECORE-MR (merge core cluster by using MapReduce) algorithm is proposed to speed up the convergence speed of merging local clusters, which improves the local clusters merging efficiency of density-based clustering algorithm. The experimental results show that the DBWGIE-MR algorithm has better clustering results and performs better parallelization in large scale dataset.

Key words: big data, density-based clustering algorithm, weighted grid, information entropy

摘要:

针对大数据下基于密度的聚类算法中存在的数据网格划分不合理,聚类结果准确度不高以及并行化效率较低等问题,提出了基于MapReduce和加权网格信息熵的DBWGIE-MR算法。首先提出自适应网格划分策略(ADG)来划分网格单元;其次提出邻居网格扩展策略(NE)用于构建每个数据分区的加权网格,以此提高聚类效果;同时提出加权网格信息熵策略(WGIE)来计算网格密度以及密度聚类算法的[ε]邻域和核心对象,使密度聚类算法更适用于加权网格;接着结合MapReduce计算模型,提出并行计算局部簇算法(COMCORE-MR),从而加快获取局部簇;最后提出了基于并查集的并行合并局部簇算法(MECORE-MR),用于加快合并局部簇的收敛速度,提升了基于密度的聚类算法对局部簇合并的效率。实验结果表明,DBWGIE-MR算法的聚类效果更佳,且在较大规模的数据集下算法的并行化性能更好。

关键词: 大数据, 密度聚类, 加权网格, 信息熵