计算机科学与探索 ›› 2012, Vol. 6 ›› Issue (10): 912-918.DOI: 10.3778/j.issn.1673-9418.2012.10.006

• 学术研究 • 上一篇    下一篇

并行化的半监督朴素贝叶斯分类算法

江  凯+,高  阳   

  1. 南京大学 计算机软件新技术国家重点实验室,南京 210093
  • 出版日期:2012-10-01 发布日期:2012-09-28

A Parallelized Semi-Supervised Na?ve Bayes Classifier

JIANG Kai+, GAO Yang   

  1. State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing 210093, China
  • Online:2012-10-01 Published:2012-09-28

摘要: 针对当前需要对海量的文本数据进行分类和用于训练的带标记的文本数据非常匮乏这两个问题,结合半监督的朴素贝叶斯分类算法和Map-Reduce编程模型,提出了一种新型的并行化的半监督朴素贝叶斯分类(parallelized semi-supervised Naïve Bayes,PSNB)算法。通过实验可以看出,PSNB算法不仅可以高效地处理海量的文本数据,还可以有效地利用无标记的文本数据来提高分类器准确率。

null

关键词: 朴素贝叶斯, 并行化, 半监督, 文本分类, 海量数据

Abstract: Nowadays TBs or even PBs data burst out every day, but there are so few labeled instances for training. For these two problems, this paper combines a semi-supervised Naïve Bayes algorithm and the Map-Reduce programming model, and proposes a new algorithm called parallelized semi-supervised Naïve Bayes (PSNB) algorithm. Experimental results show that the proposed algorithm can tackle with massive data efficiently, and use the unlabeled instances to improve the performance of the classifier.

Key words: Naïve Bayes, parallelization, semi-supervised, text classification, massive data