计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (11): 2823-2847.DOI: 10.3778/j.issn.1673-9418.2308100

• 前沿·综述 • 上一篇    下一篇

基于深度学习的抽取式摘要研究综述

田萱,李嘉梁,孟晓欢   

  1. 1. 北京林业大学 信息学院,北京 100083
    2. 国家林业草原林业智能信息处理工程技术研究中心,北京 100083
  • 出版日期:2024-11-01 发布日期:2024-10-31

Survey of Deep Learning Based Extractive Summarization

TIAN Xuan, LI Jialiang, MENG Xiaohuan   

  1. 1. School of Information Science and Technology, Beijing Forestry University, Beijing 100083, China
    2. Engineering Research Center for Forestry-Oriented Intelligent Information Processing of National Forestry and Grassland Administration, Beijing 100083, China
  • Online:2024-11-01 Published:2024-10-31

摘要: 自动文本摘要(ATS)是自然语言处理的热门研究方向,主要实现方法分为抽取式和生成式两类。抽取式摘要直接采用源文档中的文字内容,相比生成式摘要具有更高的语法正确性和事实正确性,在政策解读、官方文件总结、法律和医药等要求较为严谨的领域具有广泛应用前景。目前基于深度学习的抽取式摘要研究受到广泛关注。主要梳理了近几年基于深度学习的抽取式摘要技术研究进展;针对抽取式摘要的两个关键步骤——文本单元编码和摘要抽取,分别分析了相关研究工作。根据模型框架的不同,将文本单元编码方法分为层级序列编码、基于图神经网络的编码、融合式编码和基于预训练的编码四类进行介绍;根据摘要抽取阶段抽取粒度的不同,将摘要抽取方法分为文本单元级抽取和摘要级抽取两类进行分析。介绍了抽取式摘要任务常用的公共数据集和性能评估指标。预测并分析总结了该领域未来可能的研究方向及相应的发展趋势。

关键词: 深度学习, 抽取式摘要, 文本单元编码, 摘要抽取

Abstract: Automatic text summarization (ATS) is a popular research direction in natural language processing, and its main implementation methods are divided into two categories: extractive and abstractive. Extractive summarization directly uses the text content in the source document, and compared with abstractive summarization, it has higher grammatical and factual correctness, and has broad prospects for extractive summarization in domains such as policy interpretation, official document summarization, legal and medicine industry, etc. In recent years, extractive summarization based on deep learning has received extensive attention. This paper mainly reviews the research progress of extractive summarization technology based on deep learning in recent years, and analyzes the relevant research work for the two key steps of extractive summarization: text unit encoding and summary extraction. Firstly, according to the different model frameworks, text unit encoding methods are divided into four categories: hierarchical sequential encoding, encoding based on graph neural networks, fusion encoding, and pre-training-based encoding. Then, according to the different granularity of summary extraction in the summary extraction stage, summary extraction methods are divided into two categories: text unit-level extraction and summary-level extraction. This paper also introduces commonly used public datasets and performance evaluation indicators for extractive summarization tasks. Finally, the future possible research directions and corresponding development trends in this field are predicted and summarized.

Key words: deep learning, extractive summarization, text unit encoding, summary extraction