计算机科学与探索 ›› 2015, Vol. 9 ›› Issue (2): 242-248.DOI: 10.3778/j.issn.1673-9418.1407006

• 人工智能与模式识别 • 上一篇    下一篇

基于LDA重要主题的多文档自动摘要算法

刘  娜+,路  莹,唐晓君,李明霞   

  1. 大连工业大学 信息科学与工程学院,辽宁 大连 116034
  • 发布日期:2015-02-03

Multi-Document Summarization Algorithm Based on Significance Topic of LDA

LIU Na+, LU Ying, TANG Xiaojun, LI Mingxia   

  1. School of Information Science & Engineering, Dalian Polytechnic University, Dalian, Liaoning 116034, China
  • Published:2015-02-03

摘要: 提出了基于LDA(latent Dirichlet allocation)重要主题的多文档自动摘要算法。该算法与已有的基于主题模型的多文档自动摘要算法主要有两点区别:第一,在计算句子主题与文档主题相似度问题上,引入并定义了主题重要性的概念,将LDA模型建立的主题分成重要和非重要主题两类,计算句子权重时重点考虑句子主题和文档重要主题的相似性;第二,该方法同时使用句子的词频、位置等统计特征和LDA特征组成的向量计算句子的权重,既突出了传统的统计特征的显著优势,又结合了LDA模型的主题概念。实验表明,该算法在DUC2002标准数据集上取得了较好的摘要效果。

关键词: 多文档摘要, 主题模型, 重要主题

Abstract: This paper proposes a multi-document summarization algorithm based on significance topic of LDA (latent Dirichlet allocation) model. There are two differences between this algorithm and other algorithms based on LDA model. Firstly, this algorithm gives the definition of significant topic, divides topic into significance topic and insignificance topic, calculates similarity between sentence and document using significance topic. Secondly, beside topic characteristics, this algorithm also considers some statistics characteristics, such as term frequency, sentence position, sentence length, etc. This algorithm not only highlights the advantages of statistics characteristics, but also cooperates with LDA topic model. The experiments show that the proposed algorithm achieves better performance compared to the other state-of-the-art algorithms on DUC2002 corpus.

Key words: multi-document summarization, topic model, significance topic