计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (2): 260-273.DOI: 10.3778/j.issn.1673-9418.1901069

• 人工智能 • 上一篇    下一篇

多层梯度提升树在药品鉴别中的应用

杜师帅,邱天,李灵巧,胡锦泉,郑安兵,冯艳春,胡昌勤,杨辉华   

  1. 1. 北京邮电大学 自动化学院,北京 100876
    2. 北京理工大学 光电学院,北京 100081
    3. 桂林电子科技大学 电子工程与自动化学院,广西 桂林 541004
    4. 中国食品药品检定研究院,北京 100050
  • 出版日期:2020-02-01 发布日期:2020-02-16

Application of Multi-Layered Gradient Boosting Decision Trees in Pharmaceutical Classification

DU Shishuai, QIU Tian, LI Lingqiao, HU Jinquan, ZHENG Anbing, FENG Yanchun, HU Changqin, YANG Huihua   

  1. 1. School of Automation, Beijing University of Posts and Telecommunications, Beijing 100876, China
    2. School of Optoelectronics, Beijing Institute of Technology, Beijing 100081, China
    3. College of Electronic Engineering and Automation, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China
    4. National Institutes for Food and Drug Control, Beijing 100050, China
  • Online:2020-02-01 Published:2020-02-16

摘要:

近红外光谱分析技术高效应用于药品分析领域。针对高维非线性的小规模近红外数据,传统的药品鉴别算法存在特征学习能力不足的缺陷,基于神经网络的方法有局部最优及过拟合等问题,且两者易忽略样本的不均衡性。针对以上劣势,提出一种基于特征选择与代价敏感学习的多层梯度提升树(CS_FGBDT)药品分类方法。首先采用Savitsky-Golay平滑和一阶导数对原始数据进行预处理;其次利用随机森林对预处理光谱自适应提取特征,并由多层梯度提升树进行特征映射;然后结合代价敏感学习机制将样本不均衡性的负效应降到最小。实验结果表明,在胶囊和药片两种不平衡数据集上对算法进行对比评估,该模型具有更高的预测精度和稳定性,是一种有效的药品鉴别方法。

关键词: 近红外光谱分析, 自适应特征选择, 多层梯度提升决策树, 代价敏感学习

Abstract:

Near-infrared spectroscopy technology is highly effective in pharmaceutical analysis. For high-dimensional and non-linear small-scale near-infrared data, traditional drug identification algorithms lack enough feature learning ability, neural network-based methods have problems of local optima and over-fitting, and they tend to ignore the sample imbalance. Aiming at the above disadvantages, a pharmaceutical classification approach with multi-layered gradient Boosting decision trees based on feature selection and cost-sensitive learning (CS_FGBDT) is proposed. Firstly, the raw data are preprocessed by Savitsky-Golay smoothing and first derivative. Secondly, the random forest is used to adaptively extract features from the preprocessed spectra, and the feature map is constructed by multi-layered gradient Boosting trees. Then the negative effect of sample imbalance is minimized by combining cost-sensitive learning. The experimental results show that the model comparatively evaluated on two imbalanced data-sets of capsule and tablet has higher prediction accuracy and stability and is an effective method for drug identification.

Key words: near-infrared spectroscopy, adaptive feature selection, multi-layered gradient Boosting decision trees, cost-sensitive learning