计算机科学与探索 ›› 2011, Vol. 5 ›› Issue (09): 826-834.

• 学术研究 • 上一篇    下一篇

基于类标签聚类的动态问题分类集成学习算法

田晶华, 李翠平, 陈 红   

  1. 1. 中国人民大学 信息学院, 北京 100872
    2. 中国人民大学 数据工程与知识工程教育部重点实验室, 北京 100872
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-09-01 发布日期:2011-09-01

Dynamic Question Classification Ensemble Learning Algorithm Based on Class Label Clustering

TIAN Jinghua, LI Cuiping, CHEN Hong

  

  1. 1. School of Information, Renmin University of China, Beijing 100872, China
    2. Key Laboratory of Data Engineering and Knowledge Engineering, Ministry of Education, Renmin University of China, Beijing 100872, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-09-01 Published:2011-09-01

摘要: 问题分类是问答社区系统的关键技术, 分析用户提出的自然语言问题, 并返回一个确切而适当的问题类别。针对网络社区中问题分类标签众多(>1 000)、有一定层次且易受时间演化影响的问题, 提出了针对两种不同流动粒度的问题分类算法, 运用不同时刻的数据集层次集成学习方法提高了问题分类精度和效率。同时, 针对单次分类标签过多引起的特征集混淆问题, 将已有层次的分类标签树基于基分类器错误率和混淆矩阵进行聚类, 进一步提高了问题分类的精度和效率。

关键词: 问题分类, 概念漂移, 类标签聚类

Abstract: Being key step of the community question answer system, question classification analyzes natural language questions and returns specified and proper categories. Concerning the problems of network community, such as large taxonomies of categories (>1 000), label hierarchy and vulnerability to time evolution, this paper proposes two different drifting granularity methods, and uses ensemble learning of classifiers built with data in different moments, which improves accuracy and efficiency evidently. Moreover, in view of feature set confusion problem caused by overabundant class labels in one base classifier, the paper proposes a plus enhancer that clusters class labels based on error rate of base classifiers and confusion matrix, which raises classification accuracy further.

Key words: question classification, concept drift, class label cluster