计算机科学与探索 ›› 2021, Vol. 15 ›› Issue (2): 294-304.DOI: 10.3778/j.issn.1673-9418.2003022

• 人工智能 • 上一篇    下一篇

融合词和文档嵌入的关键词抽取算法

祖弦,谢飞,刘啸剑   

  1. 1. 合肥师范学院 计算机学院,合肥 230061
    2. 合肥工业大学 计算机与信息学院,合肥 230009
  • 出版日期:2021-02-01 发布日期:2021-02-01

Keyphrase Extraction Combining Word and Document Embeddings

ZU Xian, XIE Fei, LIU Xiaojian   

  1. 1. School of Computer Science, Hefei Normal University, Hefei 230061, China
    2. School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China
  • Online:2021-02-01 Published:2021-02-01

摘要:

各类应用领域的文本数据日益增多,如何从这些海量数据中迅速准确地提取核心内容,已成为关键词抽取的主要任务。提出一种基于词和文档嵌入的关键词抽取方法,通过计算单词与文档在同一维度上的向量表示,得出每个单词与文档之间的语义相似度,将其作为无向图中每个单词节点的初始权重。接着使用带语义偏向的随机游走策略,计算出每个单词以及候选词的分值。最后选取得分较高的前[N]个候选词作为最终关键词。在公开数据集上的实验结果表明,该算法在准确率、召回率、[F]值上均超过现有的主流关键词抽取方法,极大提高了关键词自动抽取的效率。

关键词: 关键词抽取, 图排序, 词嵌入, 文档嵌入, 语义信息

Abstract:

With the increasing amount of text data in various application fields, how to quickly and accurately extract the main information has become the main task of keyphrase extraction. This paper proposes a novel method for keyphrase extraction based on word and document vectors. By calculating vector representation between word and document on the same dimensional vector space, the semantic similarity between word and document can be got, which can be used as the initial weight of each word node in the undirected graph. Then, this paper calculates the score of each word and candidate phrase using a semantic biased random walk strategy. Finally, the top[N] scored candidate phrases are selected as the final keyphrases. Experimental results on the public datasets show that the proposed algorithm outperforms the state-of-the-art keyphrase extraction methods in precision, recall, and F-measure. It can greatly improve the efficiency of automatic keyphrase extraction.

Key words: keyphrase extraction, graph sorting, word embedding, document embedding, semantic information