Journal of Frontiers of Computer Science and Technology ›› 2022, Vol. 16 ›› Issue (3): 621-636.DOI: 10.3778/j.issn.1673-9418.2109014

• Artificial Intelligence • Previous Articles     Next Articles

Abstractive Text Summarization Model with Coherence Reinforcement and No Ground Truth Dependency

CHEN Gongchi1, RONG Huan1,+(), MA Tinghuai2   

  1. 1. School of Artificial Intelligence (School of Future Technology), Nanjing University of Information Science & Technology, Nanjing 210044, China
    2. School of Computer Science, Nanjing University of Information Science & Technology, Nanjing 210044, China;
  • Received:2021-09-06 Revised:2021-11-22 Online:2022-03-01 Published:2021-11-30
  • About author:CHEN Gongchi, born in 2000. His research interests include natural language processing, text summarization, etc.
    RONG Huan, born in 1990, Ph.D., lecturer. His research interests include social media mining, content security on social network, knowledge engineering, etc.
    MA Tinghuai, born in 1974, Ph.D., professor.His research interests include social network privacy protection, big data mining, text emotion computing, etc.
  • Supported by:
    National Natural Science Foundation of China(62102187);Natural Science Foundation of Jiangsu Province (Basic Research Program)(BK20210639);Provincial College Students Innovation and Entrepreneurship Training Program of Jiangsu Province in 2021(202110300093Y);National Key Research and Development Program of China(2021YFE0104400)

面向连贯性强化的无真值依赖文本摘要模型

陈共驰1, 荣欢1,+(), 马廷淮2   

  1. 1.南京信息工程大学 人工智能学院(未来技术学院),南京 210044
    2.南京信息工程大学 计算机学院(软件学院、网络空间安全学院),南京 210044
  • 通讯作者: + E-mail: ronghuan@nuist.edu.cn
  • 作者简介:陈共驰(2000—),男,四川自贡人,主要研究方向为自然语言处理、文本摘要等。
    荣欢(1990—),男,江苏南京人,博士,讲师,主要研究方向为社交媒体挖掘、社交网络内容安全、知识工程等。
    马廷淮(1974—),男,重庆人,博士,教授,主要研究方向为社交网络隐私保护、大数据挖掘、文本情感计算等。
  • 基金资助:
    国家自然科学基金(62102187);江苏省自然科学基金(基础研究计划)(BK20210639);2021年江苏省省级大学生创新创业训练计划项目(202110300093Y);国家重点研发计划(2021YFE0104400)

Abstract:

Automatic text summarization aims to compress a given document, which can efficiently reflect the main idea of the source document with a short summary. At present, abstractive summarization method has become a research hotspot in the field of text summarization because it can paraphrase the source document with flexible and abundant vocabulary. However, existing abstractive summarization model reorganizes original words and adds new words when generating summary. That’s why it can easily cause the inconsistency and low readability. In addition, the traditional supervised learning based on labeled data requires high cost to improve the coherence of summary sentences, which limits the practical application. Therefore, this paper proposes an abstractive text summarization model with coherence reinforcement and no ground truth dependency (ATS_CG). On the one hand, based on the embdding of the source document, the model generates extractive label to describe the filtering process of the key information. And then, the filtered sentence embeddings are decoded by the decoder. On the other hand, based on the original word probability distribution output by the decoder, two types of summarization are generated according to “probability selection” and “Softmax-greedy selection”. And then, the model will compute the overall rewards of the two types of summarization from the two aspects of coherence and content. Next, the model will learn to filter key sentences and decode them through the self-critical policy gradient, so as to generate abstractive summarizaion with high coherence and quality. Experiments show that ATS_CG is superior to the existing text summarization methods in terms of evaluation scores on the whole, even without any ground truth. At the same time, abstractive summarization generated by ATS_CG is also better than the existing methods in coherence, relevance, redundancy, novelty and perplexity.

Key words: automatic text summarization, natural language processing, reinforcement learning, information retrieval and integration

摘要:

自动文本摘要技术旨在凝练给定文本,以篇幅较短的摘要有效反映出原文核心内容。现阶段,生成型文本摘要技术因能够以更加灵活丰富的词汇对原文进行转述,已成为文本摘要领域的研究热点。然而,现有生成型文本摘要模型在产生摘要语句时涉及对原有词汇的重组与新词的添加,易造成摘要语句不连贯、可读性低。此外,通过传统基于已标注数据的有监督训练提升摘要语句连贯性,需投入较高的数据成本,致使实际应用受限。为此,提出了一种面向连贯性强化的无真值依赖文本摘要(生成)模型(ATS_CG)。该模型在仅给定原文本的限制条件下,一方面,基于原文本的编码结果,产生语句抽取标识,刻画对原文关键信息的筛选过程,由解码器对筛选后的语句编码进行解码;另一方面,基于解码器输出的原始词汇分布,分别按“概率选择”与按“Softmax-贪婪选择”产生两类摘要文本。综合语句连贯性与语句内容两方面,构建两类摘要文本的总体收益后,利用自评判策略梯度,引导模型学习关键语句筛选以及对所筛选关键语句进行解码,生成语句连贯性高、内容质量好的摘要文本。实验表明,即便不给定任何事先标注的摘要真值,所提出模型的摘要内容指标总体上仍优于现有文本摘要方法;与此同时,ATS_CG生成的摘要文本在语句连贯性、内容重要性、信息冗余性、词汇新颖度和摘要困惑度方面亦优于现有方法。

关键词: 自动文本摘要, 自然语言处理, 强化学习, 信息检索与集成

CLC Number: