Journal of Frontiers of Computer Science and Technology ›› 2016, Vol. 10 ›› Issue (9): 1299-1309.DOI: 10.3778/j.issn.1673-9418.1509018

Previous Articles     Next Articles

Research on Entropy-Based Term Weighting Methods in Text Categorization

CHEN Kewen+, ZHANG Zuping, LONG Jun   

  1. School of Information Science and Engineering, Central South University, Changsha 410083, China
  • Online:2016-09-01 Published:2016-09-05

文本分类中基于熵的词权重计算方法研究

陈科文+,张祖平,龙  军   

  1. 中南大学 信息科学与工程学院,长沙 410083

Abstract: As the volume of textual data has become very large and is still increasing rapidly, automatic text categorization (TC) is becoming more and more important. Term weighting or feature weight calculation is one of the hot research topics in TC to improve the classification accuracy. It is found that entropy-based weighting (EW) methods are usually more effective than others. However, there are still some problems with the existing EW methods, e.g., they may perform worse than the traditional TF-IDF (term frequency & inverse document frequency), for TC on some text corpora. So this paper proposes a new term weighting scheme called LTF-ECDP, which combines logarithmic term frequency and entropy-based class distinguishing power as a new weighting factor. In order to test LTP-ECDP and compare it with other weighting methods, a considerable number of TC experiments using support vector machine (SVM) have been done on three popular benchmark datasets including a Chinese corpus, TanCorp, and two English corpora such as WebKB and 20 Newsgroups. The experimental results show that LTF-ECDP outperforms the other five entropy-based weighting methods and two famous methods such as TF-IDF and TF-RF (term frequency & relevance frequency). Compared with the other term weighting methods, LTF-ECDP can further improve the accuracy of TC while keeping good performance on different datasets consistently.

Key words: term weighting, entropy-based weighting, text categorization, class distinguishing power

摘要: 随着文本数据量变得很大且仍在迅猛增加,自动文本分类变得越来越重要。为了提高分类准确率,作为文本特征的词的权重计算方法是文本分类领域的研究热点之一。研究发现,基于信息熵的权重计算方法(熵加权)相对于其他方法更有效,但现有方法仍然存在问题,比如在某些语料库上相比TF-IDF(term frequency & inverse document frequency),它们可能表现较差。于是将对数词频与一个新的基于熵的类别区分力度量因子相结合,提出了LTF-ECDP(logarithmic term frequency & entropy-based class distinguishing power)方法。通过在TanCorp、WebKB和20 Newsgroups语料库上使用支持向量机(support vector machine,SVM)进行一系列文本分类实验,验证和比较了8种词权重计算方法的性能。实验结果表明,LTF-ECDP方法比其他熵加权方法和TF-IDF、TF-RF(term frequency & relevance frequency)等著名方法更优越,不仅提高了文本分类准确率,而且在不同数据集上的性能更加稳定。

关键词: 特征词权重, 熵加权, 文本分类, 类别区分力