计算机科学与探索 ›› 2010, Vol. 4 ›› Issue (5): 445-454.DOI: 10.3778/j.issn.1673-9418.2010.05.007

• 学术研究 • 上一篇    下一篇

适于垃圾文本流过滤的条件概率集成方法*

刘伍颖, 王 挺+   

  1. 国防科技大学 计算机学院, 长沙 410073
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2010-05-11 发布日期:2010-05-11
  • 通讯作者: 王 挺

Ensemble Approach of Conditional Probability for Spam Text Stream Filtering*

LIU Wuying, WANG Ting+   

  1. College of Computer, National University of Defense Technology, Changsha 410073, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2010-05-11 Published:2010-05-11
  • Contact: WANG Ting

摘要: 通过用于垃圾文本流过滤的在线文本分类研究, 提出了一种新的条件概率集成方法。采用语汇序列表示文本, 使用索引结构存储分类知识, 设计实现了分类模型的在线训练算法和在线分类算法。抽取电子邮件和手机短信的多种文本特征, 分别在TREC07P电子邮件语料和真实中文手机短信语料上进行了垃圾信息过滤实验。实验结果表明, 提出的方法能够获得很好的垃圾信息过滤效果。

关键词: 垃圾过滤, 文本流, 集成条件概率, 语汇序列, 索引

Abstract: Through the investigation of online text classification for spam text stream filtering, a novel ensemble approach of conditional probability is proposed. Applying token sequence to represent text and applying index to store classification knowledge, an online training algorithm of classification model and an online classifying algorithm are designed and implemented. Through multiple text features extraction from email and short message service (SMS) document, some spam filtering experiments are run on TREC07P email corpus and real Chinese SMS corpus sepa-rately. The experimental results show that the proposed approach can achieve better filtering effect.

Key words: spam filtering, text stream, ensemble conditional probability(ECP), token sequence, index

中图分类号: