计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (4): 892-901.DOI: 10.3778/j.issn.1673-9418.2107033

• 人工智能·模式识别 • 上一篇    下一篇

重采样和集成学习相结合的文本多标签分类

王天昊,张沛,张昭,陈西海,王晶,张柏礼   

  1. 1. 东南大学 计算机科学工程学院,南京 211189
    2. 国网山东省电力公司枣庄供电公司,山东 枣庄 277099
    3. 智能电网保护和运行控制国家重点实验室,南京 211106
    4. 南瑞集团,南京 211106
  • 出版日期:2023-04-01 发布日期:2023-04-01

Multi-label Classification Based on Resampling and Ensemble Learning

WANG Tianhao, ZHANG Pei, ZHANG Zhao, CHEN Xihai, WANG Jing, ZHANG Baili   

  1. 1. School of Computer Science and Engineering, Southeast University, Nanjing 211189, China
    2. State Grid Zaozhuang Power Supply Company, Zaozhuang, Shandong 277099, China
    3. State Key Laboratory of Smart Grid Protection and Control, Nanjing 211106,  China
    4. Nari Group Corporation, Nanjing 211106, China
  • Online:2023-04-01 Published:2023-04-01

摘要: 医患纠纷类裁判文书的多标签分类是对其进行高效检索和管理的基础,然而,医患纠纷数据集的类别不平衡和标签共生现象直接影响到文书的多标签分类效果。为此,提出了一种重采样和集成学习相结合的文本多标签分类方案。该方案首先提出一种基于标签集合平均稀疏度的样本重采样算法,用于降低标签共生对重采样的影响,从而改善数据集的类别不平衡性;然后,提出一种基于集成学习的多标签分类算法,其基于重采样后的数据集分别训练出多个基分类器,并对各基分类器以一票否决的投票策略进行组合,从而进一步提升分类器的多标签分类效果。实验结果表明,提出的多标签分类方案不仅适用于医患纠纷类裁判文书,而且适用于其他存在类别不平衡和标签共生问题的文本数据集。

关键词: 类别不平衡, 多标签分类, 集成学习, 重采样算法, 标签共生

Abstract: The multi-label classification of medical dispute judgment documents is the basis of efficient retrieval and management, but its effect is affected directly by the class imbalance and label co-occurrence of medical dispute dataset. Therefore, this paper proposes a multi-label classification scheme based on sample resampling and ensemble learning. The scheme includes two parts: in the first part, a resampling algorithm based on the average sparsity of label set is proposed to reduce the impact of label co-occurrence on resampling, so as to improve the class imbalance of dataset; in the second part, a multi-label classification algorithm based on ensemble learning is proposed. It trains multiple base classifiers based on multiple datasets obtained after resampling, and then combines the base classifiers with the voting strategy of one vote veto, so as to further improve the multi-label classification effect of the classifier. Experimental results show that the scheme proposed in this paper is not only suitable for medical dispute judgment documents, but also available for other text datasets with class imbalanced and label co-occurrence problems.

Key words: class imbalance, multi-label classification, ensemble learning, resampling algorithm, label co-occurrence