计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (9): 2337-2348.DOI: 10.3778/j.issn.1673-9418.2406041

• 垂直领域大模型构建与应用专题 • 上一篇    下一篇

生成式大语言模型在中文放射医学领域的应用研究

陈龙飞,高鑫,侯皓天,叶初阳,刘亚欧,张美慧   

  1. 1. 北京理工大学 计算机学院,北京 100081
    2. 北京理工大学 集成电路与电子学院,北京 100081
    3. 首都医科大学附属北京天坛医院 放射科,北京 100070
  • 出版日期:2024-09-01 发布日期:2024-09-01

Application of Generative Large Language Models in Chinese Radiology Domain

CHEN Longfei, GAO Xin, HOU Haotian, YE Chuyang, LIU Ya'ou, ZHANG Meihui   

  1. 1. School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081, China
    2. School of Integrated Circuits and Electronics, Beijing Institute of Technology, Beijing 100081, China
    3. Department of Radiology, Beijing Tiantan Hospital, Capital Medical University, Beijing 100070, China
  • Online:2024-09-01 Published:2024-09-01

摘要: 在中文放射医学领域中,影像学报告是临床决策的重要依据。因此,利用自然语言处理(NLP)技术来理解和学习影像学报告的文本内容,并以此辅助完成放射科临床工作,已成为该领域的重要研究方向。然而,在使用传统方法处理基于中文影像学报告的自然语言分类与生成任务时,仍然面临训练语料匮乏且涉及隐私、模型泛化能力较差等限制导致的综合性能不足的情况。针对上述问题,提出了一种基于本地高效微调大语言模型的中文放射医学领域自然语言任务解决方案。通过收集并构建大规模高质量中文影像学报告自然语言任务数据集,采用LoRA高效微调方法对开源大语言模型Baichuan2进行有监督微调训练,提出了能够同时解决四种中文放射医学领域临床任务的“龙影大模型”。提出了一套中文放射医学领域自然语言分类与生成任务评价体系。在来自两家中心的三个医学影像种类的报告数据集上进行了多组实验,并与几种典型现有方法进行了对比,结果显示所提方法在分类性能、文本总结与扩充能力和模型泛化性上表现更好。

关键词: 大语言模型, 影像学报告, 文本分类, 文本生成, 高效微调策略

Abstract: In the Chinese radiology domain, radiology reports serve as a crucial basis for clinical decision-making. Therefore, utilizing natural language processing (NLP) technology to understand and learn from the textual content of radiology reports, thereby aiding radiological clinical work, has become an important research direction in this domain. However, when dealing with the natural language classification and generation tasks based on Chinese radiology reports using traditional methods, there are still challenges such as a lack of training corpora, privacy concerns, and poor model generalization capabilities, leading to insufficient overall performance. To address these issues, a solution for natural language tasks in the Chinese radiology domain based on locally efficient fine-tuning large language models is proposed. By collecting and constructing a large-scale, high-quality dataset for natural language tasks in the Chinese radiology reports, and employing the LoRA efficient fine-tuning method for supervised fine-tuning training of the open-source large language model Baichuan2, the “RadGPT” capable of solving four types of clinical tasks in the Chinese radiology domain simultaneously is proposed. A set of evaluation systems for natural language classification and generation tasks in the Chinese radiology domain is introduced. Multiple sets of experiments are conducted on three types of radiology report datasets from two centers, and comparisons are made with several typical existing methods. The results demonstrate that the proposed method performs better in terms of classification performance, text summarization and expansion capabilities, and model generalization.

Key words: large language model, radiology report, text classification, text generation, efficient fine-tuning strategy