Journal of Frontiers of Computer Science and Technology ›› 2021, Vol. 15 ›› Issue (5): 907-921.DOI: 10.3778/j.issn.1673-9418.2006002

• Artificial Intelligence • Previous Articles     Next Articles

SFExt-PGAbs: Two-Stage Summarization Model for Long Document

ZHOU Weixiao, LAN Wenfei, XU Zhiming, ZHU Rongbo   

  1. 1. School of Computer Science, South-Central University for Nationalities, Wuhan 430074, China
    2. School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou 350108, China
  • Online:2021-05-01 Published:2021-04-30

SFExt-PGAbs:两阶段长文档摘要模型

周伟枭蓝雯飞许智明朱容波   

  1. 1. 中南民族大学 计算机科学学院,武汉 430074
    2. 福州大学 机械工程及自动化学院,福州 350108

Abstract:

Aiming at the fluency problem of extractive method, the accuracy problem of abstractive method, and the important information missing problem caused by truncating the original document before document encoding, this paper proposes a two-stage long document summarization model SFExt-PGAbs. It is composed of submodular function for extractive summarization SFExt and pointer generator for abstractive summarization PGAbs. SFExt-PGAbs simulates the human process of summarizing a long document. First, SFExt is used to extract important sentences from the long document and filter the unimportant and redundant sentences to form a transitional document. Then, PGAbs receives the transitional document as input to generate a fluent and accurate summary. In order to get a transitional document that is closer to the original document-centered idea, this paper expands the two sub-aspects of positional importance and accuracy in the traditional SFExt, and designs a new greedy algorithm at the same time. In order to study the effect of different feature extractors on the quality of the generated summary, two kinds of recurrent neural networks are applied in PGAbs. The experimental results show that on the CNNDM test set, SFExt-PGAbs generates a more fluent and more accurate summary compared with the baseline model, and the ROUGE indicators are significantly improved. At the same time, the expanded sub-aspects of SFExt can extract more accurate summary.

Key words: two-stage summarization model, long document summarization, extractive summarization, abstractive summarization, submodular function, pointer generator, sub-aspect fusion

摘要:

针对抽取式方法、生成式方法在长文档摘要上的流畅性、准确性缺陷以及在文档编码前截断原始文档造成的重要信息缺失问题,提出一种两阶段长文档摘要模型SFExt-PGAbs,由次模函数抽取式摘要SFExt与指针生成器生成式摘要PGAbs组成。SFExt-PGAbs模拟人类对长文档进行摘要的过程,首先使用SFExt在长文档中抽取出重要句子,过滤不重要且冗余的句子形成过渡文档,然后PGAbs接收过渡文档作为输入以生成流畅且准确的摘要。为获取与原始文档中心思想更为接近的过渡文档,在传统SFExt中拓展出位置重要性、准确性两个子方面,同时设计新的贪心算法。为研究不同特征提取器对生成摘要质量的影响,在PGAbs中应用两种循环神经网络。实验结果显示,在CNNDM测试集上,SFExt-PGAbs相较于基线模型生成了更为流畅、准确的摘要,ROUGE指标有较大提升。同时,子方面拓展后的SFExt也能抽取得到更准确的摘要。

关键词: 两阶段摘要模型, 长文档摘要, 抽取式摘要, 生成式摘要, 次模函数, 指针生成器, 子方面融合