计算机科学与探索 ›› 2011, Vol. 5 ›› Issue (11): 1048-1056.

• 学术研究 • 上一篇    

利用置信度重取样的SemiBoost-CR分类模型

唐焕玲, 鲁明羽   

  1. 1. 山东工商学院 计算机科学与技术学院, 山东 烟台 264005
    2. 大连海事大学 信息科学技术学院, 辽宁 大连 116026
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-11-01 发布日期:2011-11-01

Advanced SemiBoost-CR Categorization Model Utilizing Confidence-Based Resampling

TANG Huanling, LU Mingyu   

  1. 1. School of Computer Science and Technology, Shandong Institute of Business and Technology, Yantai, Shandong 264005, China 2. School of Information Science and Technology, Dalian Maritime University, Dalian, Liaoning 116026, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-11-01 Published:2011-11-01

摘要: 结合半监督学习和集成学习方法, 提出了一种基于置信度重取样的SemiBoost-CR分类模型。给出了基于标注近邻与未标注近邻的置信度计算公式, 按照置信度重采样, 不仅选取一定比例置信度较高的未标注样本, 而且选取一定比例置信度较低的未标注样本, 分别以不同的策略加入到已标注的训练样本集。引入置信度高的未标注样本, 用以提高基分类器的正确性(accuracy); 而引入置信度低的未标注样本, 目的则是进一步增加基分类器间的差异性(diversity)。对比实验表明, SemiBoost-CR分类模型能够有效提升Naive Bayesian文本分类器的性能。

关键词: boosting, 半监督分类, 朴素贝叶斯, 置信度, 重取样

Abstract: This paper proposes SemiBoost-CR, an enhanced categorization model which utilizing the confidence- based resampling technique and incorporating semi-supervised learning with ensemble learning. The confidence score is derived from the nearer labeled neighbors and unlabeled neighbors of the example. According to the
confidence-based resampling, not only the unlabeled examples with higher confidence score, but also the unlabeled ones with lower confidence score are selected and added to the labeled training set. The accuracy of the base classi-fier is to be improved by introducing the unlabeled data with higher confidence; the diversity among the base classi-fiers is further increased by introducing the unlabeled data with lower confidence. Experimental results show that SemiBoost-CR can boost the performance of Naive Bayesian text categorization.

Key words: boosting, semi-supervised categorization, Naive Bayesian, confidence, resampling