Journal of Frontiers of Computer Science and Technology

• Science Researches •     Next Articles

A Survey of Deep Learning-Based Extractive Summarization

TIAN Xuan, LI Jialiang, MENG Xiaohuan   

  1. 1.School of Information Science and Technology, Beijing Forestry University, Beijing 100083, China
    2.Engineering Research Center for Forestry-oriented Intelligent Information Processing of National Forestry and Grassland Administration, Beijing 100083, China

基于深度学习的抽取式摘要研究综述

田萱,李嘉梁,孟晓欢   

  1. 1. 北京林业大学 信息学院,北京100083
    2. 国家林业草原林业智能信息处理工程技术研究中心,北京100083

Abstract: Automatic text summarization(ATS) is a popular research direction in natural language processing, and its main implementation methods are divided into two categories: extractive and abstractive. Extractive summarization directly uses the text content in the source document, and compared with abstractive summarization, it has higher grammatical and factual correctness. It has broad prospects for extractive summarization in domains such as policy interpretation, offical document summarization, legal and medicine industry etc. In recent years, extractive summarization based on deep learning has received extensive attention. This article mainly reviews the research progress of extractive summarization technology based on deep learning in recent years, and analyzes the relevant research work for the two key steps of extractive summarization: text unit encoding and summary extraction. Firstly, according to the different model frameworks, text unit encoding methods are divided into four categories: hierarchical sequential encoding, encoding based on graph neural networks, fusion encoding, and pre-training-based encoding. Then, according to the different granularity of summary extraction in the summary extraction stage, summary extraction methods are divided into two categories: text unit-level extraction and summary-level extraction. The paper also introduces commonly used public datasets and performance evaluation indicators for extractive summarization tasks. Finally, the future possible research directions and corresponding development trends in this field are predicted and summarized.

Key words: deep learning, extractive summarization, text unit encoding, summary extraction

摘要: 自动文本摘要是自然语言处理的热门研究方向,主要实现方法分为抽取式和生成式两类。抽取式摘要直接采用源文档中的文字内容,相比生成式摘要具有更高的语法正确性和事实正确性,在政策解读、官方文件总结、法律和医药等要求较为严谨的领域具有广泛应用前景。近几年基于深度学习的抽取式摘要研究受到广泛关注。主要梳理了近几年基于深度学习的抽取式摘要技术研究进展;针对抽取式摘要的两个关键步骤——文本单元编码和摘要抽取分别来梳理分析相关研究工作。首先根据模型框架不同,将文本单元编码方法分为层级序列编码、基于图神经网络的编码、融合式编码和基于预训练的编码等四类介绍;然后根据摘要抽取阶段抽取粒度的不同,将摘要抽取方法分为文本单元级抽取和摘要级抽取两类分析。并介绍了抽取式摘要任务常用公共数据集和性能评估指标。最后,预测并分析总结了该领域未来可能的研究方向及相应的发展趋势。

关键词: 深度学习, 抽取式摘要, 文本单元编码, 摘要抽取