计算机科学与探索 ›› 2011, Vol. 5 ›› Issue (4): 313-323.

• 学术研究 • 上一篇    下一篇

半监督文本分类综述

牛 罡, 罗爱宝, 商 琳   

  1. 南京大学 计算机软件新技术国家重点实验室, 南京 210093
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-04-01 发布日期:2011-04-01

A survey of semi-supervised text categorization

NIU Gang, LUO Aibao, SHANG Lin   

  1. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China

  • Received:1900-01-01 Revised:1900-01-01 Online:2011-04-01 Published:2011-04-01

摘要: 文本分类是人们日常工作中经常遇到的问题, 也是机器学习的重要研究内容。半监督学习算法同时考虑有标记和无标记数据, 能显著提升学习效果。给出了文本分类的定义和特点, 介绍了传统的监督学习分类算法和评价指标, 对半监督文本分类的特点和基础理论进行了分析, 并具体介绍了一些半监督文本分类算法, 如贝叶斯方法和正则化方法。

关键词: 文本分类, 半监督学习, 朴素贝叶斯, 流形和谱图

Abstract: Text categorization is a regular problem in people daily work and an interesting research area of machine learning. Semi-supervised learning algorithms, which consider both labeled and unlabeled data, can improve learning effectiveness significantly. This paper gives the definition and characteristic of text categorization and intro¬duces the traditional supervised learning algorithms and evaluation indicators. Then it analyzes the characteristic and basic theory of semi-supervised text categorization, and discusses some algorithms on semi-supervised text categori¬zation, such as Bayesian method and regularization method.

Key words: text categorization, semi-supervised learning, naï, ve Bayesian, manifold and spectralgraph