计算机科学与探索 ›› 2011, Vol. 5 ›› Issue (8): 695-706.

• 学术研究 • 上一篇    下一篇

Web实时环境两级过滤中文文本内容自学习算法

段 磊, 唐常杰左 劼, 彭 京, 刘婷婷, 苟 驰   

  1. 1. 四川大学 计算机学院, 成都 610065
    2. 成都市公安局 科技处, 成都 610017
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-08-01 发布日期:2011-08-01

Self-Study Algorithm for Filtering Chinese Text Content through Two Layers in Web Real-Time Environment

DUAN Lei, TANG Changjie, ZUO Jie, PENG Jing, LIU Tingting, GOU Chi   

  1. 1. School of Computer Science, Sichuan University, Chengdu 610065, China
    2. Department of Science & Technology, Chengdu Municipal Public Security Bureau, Chengdu 610017, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-08-01 Published:2011-08-01

摘要: 用户在互联网发布信息的自由性对Web信息内容过滤提出新的挑战。为此, 给出一种自学习的两级内容过滤算法SAFE (self-study algorithm of filtering Chinese text content)。SAFE以数据流的方式处理文本, 并根据Apriori性质, 在不依赖词典的情况下, 通过挖掘关键字和关键词实现对文档的两级内容过滤。利用真实世界Web文档验证了SAFE的有效性, 实验表明对给定的主题进行文本内容过滤, SAFE的查全率达到93.75%以上, 查准率达到100%, 执行时间能够满足Web应用的实时性要求。

关键词: 数据挖掘, 文本内容过滤, 关键词挖掘

Abstract: The publishing freedom of users on Internet poses new challenges in Web content filtering. This paper presents a self-study algorithm, called SAFE (self-study algorithm of filtering Chinese text content), for Chinese content filtering through two layers. It processes texts in the form of data stream. Based on Apriori property, SAFE filters Chinese text content through two layers by mining key characters and keywords without manual dictionary. The per-formance research of SAFE on the real-world data shows that for the given theme, the recall of SAFE is greater than 93.75% and the precision is 100%. The runtime of SAFE satisfies the real-time requirement of Web applications.

Key words: data mining, text content filtering, keywords mining