计算机科学与探索 ›› 2013, Vol. 7 ›› Issue (10): 933-941.DOI: 10.3778/j.issn.1673-9418.1305013

• 学术研究 • 上一篇    下一篇

基于风险决策的文本特征选择方法

赵世琛1,王文剑1,2+,郭虎升1   

  1. 1. 山西大学 计算机与信息技术学院,太原 030006
    2. 山西大学 计算智能与中文信息处理教育部重点实验室,太原 030006
  • 出版日期:2013-10-01 发布日期:2013-09-30

Text Feature Selection Approach Based on Venture Decision

ZHAO Shichen1, WANG Wenjian1,2+, GUO Husheng1   

  1. 1. School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China
    2. Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006, China
  • Online:2013-10-01 Published:2013-09-30

摘要: 在中文文本分类中,特征词的选择会严重影响文本分类的准确率。针对这一问题,提出了基于风险决策的文本特征选择方法,通过构造效用函数来评价文本中每个特征词对分类结果的效用值,再采用风险决策方法计算出每个特征词的损失期望,最终选择部分损失期望小的特征词以达到降维的目的。将该方法应用于中文垃圾邮件过滤与网页分类中,实验结果表明,该方法可以选取出对分类结果影响更大的特征词,使文本分类的各项指标明显提高。

关键词: 文本分类, 特征选择, 风险决策

Abstract: The selection of feature words would severely affect the accuracy of text categorization. In view of this situation, this paper proposes a novel text feature selection approach based on dynamic venture decision. This approach uses utility function to evaluate the utility value of each feature word in text categorization, then uses venture decision method to work out the loss of each feature word, finally selects some feature words with lower losses for reducing dimensions. The proposed approach is applied to the spam filtering and Web category in Chinese. The experimental results on several benchmark datasets show that the proposed feature selection approach can select those feature words which will influence the classification results greatly. In so doing, the accuracy of text classification can be improved significantly.

Key words: text categorization, feature selection, venture decision