计算机科学与探索 ›› 2013, Vol. 7 ›› Issue (11): 1009-1017.DOI: 10.3778/j.issn.1673-9418.1306012

• 学术研究 • 上一篇    下一篇

数据簸箕

钱宇华1+,成红红2,张晓琴2,梁吉业1   

  1. 1. 山西大学 计算机与信息技术学院,太原 030006
    2. 山西大学 数学科学学院,太原 030006
  • 出版日期:2013-11-01 发布日期:2013-11-04

Data Dustpan

QIAN Yuhua1+, CHENG Honghong2, ZHANG Xiaoqin2, LIANG Jiye1   

  1. 1. School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China
    2. School of Mathematics, Shanxi University, Taiyuan 030006, China
  • Online:2013-11-01 Published:2013-11-04

摘要: 大数据时代的到来给数据挖掘和知识发现带来了很大的挑战。簸箕是一种大家熟知的农用工具,能快速将不同的物体分开。基于簸箕的工作机制,提出了一个新颖的学习原理:随机并行序化原理(random parallel ranking principle,RPRP),称为数据簸箕,可高效地对数据进行排序和分类。为了验证这种学习原理的有效性与高效性,设计了一种新的聚类方法,即聚类簸箕。实验结果表明,聚类簸箕能够快速且有效地对数据进行聚类。此外,该学习原理也能够用于设计高效的分类器。该数据簸箕有望推动大数据背景下机器学习与知识发现理论与方法的发展。

关键词: 大数据集, 数据簸箕, 随机并行序化原理(RPRP), 聚类簸箕

Abstract:  Very-large-scale data bring about a great challenge for data mining and knowledge discovery. Dustpan, as a familiar tool, can rapidly differentiate objects into some clusters. Based on the work mechanism of a dustpan, this paper presents a novel learning principle: data dustpan, behind which is a random parallel ranking principle (RPRP), which can be used to efficiently rank objects from a large-scale data set. Then, through using the data dustpan, this paper develops a novel clustering method, called clustering dustpan, and its speed is quick. The experimental results show that the clustering dustpan algorithm is very efficient for organizing data. It is worth noting that the data dustpan can be used to efficiently learn a classifier when dealing with a large-scale data set.

Key words: very-large-scale data set, data dustpan, random parallel ranking principle (RPRP), clustering dustpan