计算机科学与探索 ›› 2014, Vol. 8 ›› Issue (9): 1113-1119.DOI: 10.3778/j.issn.1673-9418.1403064

• 人工智能与模式识别 • 上一篇    下一篇

二次集成学习在医疗数据挖掘中的应用

魏秀参+,慕  鑫,杨  杨   

  1. 南京大学 计算机软件新技术国家重点实验室,南京 210023
  • 出版日期:2014-09-01 发布日期:2014-09-03

An Application in Medical Data Mining Based on Twice Ensemble Learning

WEI Xiushen+, MU Xin, YANG Yang   

  1. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
  • Online:2014-09-01 Published:2014-09-03

摘要: CCDM 2014数据挖掘竞赛基于医学诊断数据,提出了实际生活中广泛出现的多类标问题和多类分类问题。针对两个问题出现的类别不平衡现象以及训练样本较少等特点,为了更好地完成数据挖掘任务,借助二次学习和集成学习的思想,提出了一个新的学习框架——二次集成学习。该学习框架通过首次集成学习得到若干置信度较高的样本,将其加入到原始训练集,并在新的训练集上进行二次学习,进而得到泛化性能更高的分类器。竞赛结果表明,与常用的集成学习相比,二次集成学习在两个问题上均取得了非常理想的结果。

关键词: 二次学习, 集成学习, 类别不平衡学习, 数据挖掘

Abstract: This CCDM 2014 Data Mining Competition focused on the medical diagnosis datasets. It proposed two popular problems in real world, i.e., multi-label problem and multi-class classification problem. In order to solve these data mining tasks much better, aiming at the problems of class imbalance and less training instances, this paper proposes a new learning framework, i.e., the twice ensemble learning framework, to tackle these difficulties. This new framework can get some instances with higher confidence, and inputs them into the original training dataset. Finally, it can obtain classifiers with higher generalization capability. The results of this competition show that the twice ensemble learning framework can get very ideal performance on these two problems.

Key words: twice learning, ensemble learning, class imbalance learning, data mining