Journal of Frontiers of Computer Science and Technology ›› 2016, Vol. 10 ›› Issue (7): 1003-1009.DOI: 10.3778/j.issn.1673-9418.1507102

Previous Articles     Next Articles

Semi-supervised Clustering Algorithm for Complex Distributed Data

ZHANG Junxi1+, WU Xiaojun2, JIANG Jianghong3   

  1. 1. College of Vehicle Engineering, Xi’an Aeronautical University, Xi’an 710077, China
    2. College of Automation, Northwestern Polytechnical University, Xi’an 710047, China
    3. School of Computer Science, Shaanxi Normal University, Xi’an 710062, China
  • Online:2016-07-01 Published:2016-07-01



  1. 1. 西安航空学院 车辆工程学院,西安 710077
    2. 西北工业大学 自动化学院,西安 710047
    3. 陕西师范大学 计算机科学学院,西安 710062

Abstract: Semi-supervised clustering algorithm is a machine learning method which uses the priori information to improve the clustering process. Cellular automata (CA) distance transform algorithm is induced into the process of semi-supervised clustering. The dataset is divided into several clusters by distance transform of cellular automata, and then the number of clusters and the constraint information are obtained, which can be used as priori information of the next phase of clustering. In the second phase of clustering, the semi-supervised K-means clustering algorithm is used to further divide the results of the first phase and the final clustering results are got. Based on that, this paper proposes the CA-K-means clustering algorithm. By comparing the proposed algorithm with K-means algorithm, GA-K-means and pure CA clustering algorithm, the experimental results on three artificial data sets and three UCI data sets with different structures show that the novel algorithm has higher clustering accuracy for complex distributed data and more optimal clustering feature.

Key words: cellular automata, semi-supervised clustering algorithm, K-means clustering algorithm, CA-K-means two phases clustering algorithm, complex distribution

摘要: 半监督聚类是一种用先验信息完善聚类过程的机器学习方法。通过将元胞自动机(cellular automata,CA)距离变换算法引入到半监督聚类过程中,采用平面距离变换算法将数据集划分为若干子类,获得聚类数和约束信息,并作为下一阶段聚类的先验信息。利用半监督K-means聚类算法对第一阶段的聚类结果做进一步划分,可以获得完整的聚类中心和聚类数,并由此提出CA-K-means二阶段聚类算法。采用3组人工数据集和3组标准UCI数据集进行对比仿真实验,将CA-K-means二阶段聚类算法与半监督K-means聚类算法、遗传K-means聚类算法和单纯的CA层次聚类算法进行对比,结果显示,该算法对复杂分布数据的聚类准确率较高,聚类性能更加优良。

关键词: 元胞自动机, 半监督聚类, K-means聚类算法, CA-K-means二阶段聚类, 复杂分布