文本分类中基于熵的词权重计算方法研究

doi:10.3778/j.issn.1673-9418.1509018

计算机科学与探索 ›› 2016, Vol. 10 ›› Issue (9): 1299-1309.DOI: 10.3778/j.issn.1673-9418.1509018

文本分类中基于熵的词权重计算方法研究

陈科文+，张祖平，龙军

中南大学信息科学与工程学院，长沙 410083

出版日期:2016-09-01 发布日期:2016-09-05

Research on Entropy-Based Term Weighting Methods in Text Categorization

CHEN Kewen+, ZHANG Zuping, LONG Jun

School of Information Science and Engineering, Central South University, Changsha 410083, China

Online:2016-09-01 Published:2016-09-05

摘要/Abstract

摘要： 随着文本数据量变得很大且仍在迅猛增加，自动文本分类变得越来越重要。为了提高分类准确率，作为文本特征的词的权重计算方法是文本分类领域的研究热点之一。研究发现，基于信息熵的权重计算方法（熵加权）相对于其他方法更有效，但现有方法仍然存在问题，比如在某些语料库上相比TF-IDF（term frequency & inverse document frequency），它们可能表现较差。于是将对数词频与一个新的基于熵的类别区分力度量因子相结合，提出了LTF-ECDP（logarithmic term frequency & entropy-based class distinguishing power）方法。通过在TanCorp、WebKB和20 Newsgroups语料库上使用支持向量机（support vector machine，SVM）进行一系列文本分类实验，验证和比较了8种词权重计算方法的性能。实验结果表明，LTF-ECDP方法比其他熵加权方法和TF-IDF、TF-RF（term frequency & relevance frequency）等著名方法更优越，不仅提高了文本分类准确率，而且在不同数据集上的性能更加稳定。

关键词: 特征词权重, 熵加权, 文本分类, 类别区分力

Abstract: As the volume of textual data has become very large and is still increasing rapidly, automatic text categorization (TC) is becoming more and more important. Term weighting or feature weight calculation is one of the hot research topics in TC to improve the classification accuracy. It is found that entropy-based weighting (EW) methods are usually more effective than others. However, there are still some problems with the existing EW methods, e.g., they may perform worse than the traditional TF-IDF (term frequency & inverse document frequency), for TC on some text corpora. So this paper proposes a new term weighting scheme called LTF-ECDP, which combines logarithmic term frequency and entropy-based class distinguishing power as a new weighting factor. In order to test LTP-ECDP and compare it with other weighting methods, a considerable number of TC experiments using support vector machine (SVM) have been done on three popular benchmark datasets including a Chinese corpus, TanCorp, and two English corpora such as WebKB and 20 Newsgroups. The experimental results show that LTF-ECDP outperforms the other five entropy-based weighting methods and two famous methods such as TF-IDF and TF-RF (term frequency & relevance frequency). Compared with the other term weighting methods, LTF-ECDP can further improve the accuracy of TC while keeping good performance on different datasets consistently.

Key words: term weighting, entropy-based weighting, text categorization, class distinguishing power

陈科文，张祖平，龙军. 文本分类中基于熵的词权重计算方法研究[J]. 计算机科学与探索, 2016, 10(9): 1299-1309.

CHEN Kewen, ZHANG Zuping, LONG Jun. Research on Entropy-Based Term Weighting Methods in Text Categorization[J]. Journal of Frontiers of Computer Science and Technology, 2016, 10(9): 1299-1309.

[1]	任家东，王倩，王菲，李亚洲，刘佳新. S-C特征提取的计算机漏洞自动分类算法[J]. 计算机科学与探索, 2020, 14(7): 1173-1182.
[2]	王雯，赵衎衎，李翠平，陈红，孙辉. Spark平台下的短文本特征扩展与分类研究[J]. 计算机科学与探索, 2017, 11(5): 732-741.
[3]	由从哲，吴小俊. 视角熵权重的中心化多视角模糊聚类[J]. 计算机科学与探索, 2014, 8(11): 1400-1406.
[4]	赵世琛，王文剑，郭虎升. 基于风险决策的文本特征选择方法[J]. 计算机科学与探索, 2013, 7(10): 933-941.
[5]	江凯，高阳. 并行化的半监督朴素贝叶斯分类算法[J]. 计算机科学与探索, 2012, 6(10): 912-918.
[6]	牛罡, 罗爱宝, 商琳. 半监督文本分类综述[J]. 计算机科学与探索, 2011, 5(4): 313-323.

文本分类中基于熵的词权重计算方法研究

Research on Entropy-Based Term Weighting Methods in Text Categorization

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 6

编辑推荐

Metrics