计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (7): 1142-1153.DOI: 10.3778/j.issn.1673-9418.1907013

• 数据库技术 • 上一篇    下一篇

动态区间的加权模糊聚类算法

罗浩,王彦捷,牛明航,邱存月,张利   

  1. 辽宁大学 信息学院,沈阳 110036
  • 出版日期:2020-07-01 发布日期:2020-08-12

Weighted Fuzzy Clustering Algorithm Based on Dynamic Interval

LUO Hao, WANG Yanjie, NIU Minghang, QIU Cunyue, ZHANG Li   

  1. College of Information, Liaoning University, Shenyang 110036, China
  • Online:2020-07-01 Published:2020-08-12

摘要:

数据聚类在数据挖掘、数据分析中广泛应用,而不完整数据对数据聚类造成了很大困扰。针对不完整数据聚类中估值法填补缺失属性不准确的问题,提出动态区间的加权模糊聚类算法。首先,由属性相关度构造缺失属性的最近邻样本集,进而形成缺失属性估值区间。为进一步减小区间填补误差,使用基于最近邻样本集的离散度的区间因子来动态调节区间大小。其次,为充分挖掘属性空间的隐含信息,同时降低离群点对聚类中心的影响,对完整的区间型数据集进行基于局部密度的样本加权。最后,通过以上改进完成区间型样本的加权模糊聚类。利用多个UCI数据集和人工数据集验证提出的聚类算法,实验结果表明:动态区间的加权模糊聚类算法能有效提高聚类准确性、鲁棒性以及收敛的稳定性。

关键词: 不完整数据, 区间填补, 加权, 聚类算法

Abstract:

Clustering is widely used in data mining and data analysis, and a great many of troubles have been caused by incomplete data in clustering. Aiming at the inaccurate problem of filling missing attributes with estimation method in incomplete data clustering, a weighted fuzzy clustering algorithm is proposed based on dynamic interval. Firstly, the nearest neighbor sample sets of the missing attribute are constructed by the attribute correlation and then the missing attribute interval is formed. To further reduce the interval filling error, the interval factor which is based on the dispersion of the nearest neighbor sample set is used to adjust the interval size. Secondly, in order to fully exploit the implicit information of the attribute space and reduce the influence of the outliers on the cluster center, complete interval datasets are weighted based on local density of samples. Finally, the interval weighted fuzzy clustering is completed by the above improvement. The proposed clustering algorithm is verified by multiple UCI datasets and artificial datasets. The experimental results show that the weighted fuzzy clustering algorithm of dynamic interval can effectively improve the clustering accuracy, robustness and stability of convergence.

Key words: incomplete data, interval filling, weighting, clustering algorithm