Journal of Frontiers of Computer Science and Technology ›› 2013, Vol. 7 ›› Issue (7): 639-648.DOI: 10.3778/j.issn.1673-9418.1305006

Previous Articles     Next Articles

Rule-Based Classifier for Probabilistic Data

ZHAO Tingting1,2, ZHAO Suyun1+, PEI Bin1,2,3, CHEN Hong1,2, LI Cuiping1,2   

  1. 1. Key Laboratory of Data Engineering and Knowledge Engineering, Ministry of Education, Renmin University of China, Beijing 100872, China
    2. School of Information, Renmin University of China, Beijing 100872, China
    3. Computer Science Research Section, Army Officer Academy of PLA, Hefei 230031, China
  • Online:2013-07-01 Published:2013-07-02

概率数据上基于规则的分类器

赵婷婷1,2,赵素云1+,裴  斌1,2,3,陈  红1,2,李翠平1,2   

  1. 1. 中国人民大学 数据工程与知识工程教育部重点实验室,北京 100872
    2. 中国人民大学 信息学院,北京 100872
    3. 解放军陆军军官学院 计算机教研室,合肥 230031

Abstract: Classification as an important problem in data mining is widely studied and applied nowadays, but the previous study is mainly about classification on certain data. Since probabilistic data exist and are widely used in many fields, such as sensor data, it is necessary to do feature selection for probabilistic databases. Firstly, this paper proposes a new probabilistic data model, which considers not only the randomness but also the similarity of different intervals. Secondly, in order to do classification for such probabilistic data, this paper designs a discernible distance to measure the distance between such tuples. Finally, this paper proposes a basic rule-based classification algorithm, and develops a new variable distance to reduce classification sensitivity to noise or perturbation. The Experimental results verify the effectiveness of the proposed algorithm.

Key words: classification, randomness, probabilistic data, discernible distance

摘要: 分类作为一类重要的数据挖掘问题被广泛地研究和应用,然而先前的研究主要针对确定数据上的分类问题,由于目前例如传感器等数据采集工具的普遍使用,概率数据广泛存在,在这类数据上进行分类研究十分必要。提出了一种新的概率数据模型,它既考虑了概率分布上的随机性,又包含了独立区间上的相似度;定义了一种新的辨识距离来衡量这类概率数据元组之间的距离;最后提出了概率数据上基于规则的分类算法,在基础分类算法上,引入了一种带有可变精度的分类算法来降低噪声或者扰动,提高了分类的精度。实验结果证明了该算法的有效性。

关键词: 分类, 随机性, 概率数据, 辨识距离