Journal of Frontiers of Computer Science and Technology ›› 2019, Vol. 13 ›› Issue (9): 1481-1492.DOI: 10.3778/j.issn.1673-9418.1808005

Previous Articles     Next Articles

Research on Technologies of Chinese Key-Phrase Automatic Extraction

RONG Chuitian, LI Yinyin, WANG Yan   

  1. 1.School of Computer Science and Technology, Tianjin Polytechnic University, Tianjin 300387, China
    2.School of Computer and Information Engineering, Xiamen University of Technology, Xiamen, Fujian 361024, China
  • Online:2019-09-01 Published:2019-09-06

中文关键短语自动提取方法研究

荣垂田李银银王琰   

  1. 1.天津工业大学 计算机科学与技术学院,天津 300387
    2.厦门理工学院 计算机与信息工程学院,福建 厦门 361024

Abstract: The SegPhrase algorithm is the state-of-art algorithm for key phrases extraction. It can get higher precision and recall in key phrases extraction than existing methods. However, SegPhrase algorithm has some shortcomings in key phrases extraction and their quality evaluation. In order to improve the quality of key phrases extraction and achieve effective Chinese key-phrase extractions, the SegPhrase algorithm is improved in this paper. In the phase of phrase generation, the mutual information feature between words is applied to preserve some low-frequency but important phrases. In the phase of phrase quality evaluation, different weights are assigned to different phrases to make the comprehensive assessment of the phrase. Then, the phrases that are more suitable to the context are selected. Finally, in order to verify the quality of the extracted key phrases, the extracted key phrases are applied to the topic analysis. Experiments show that the improved SegPhrase algorithm has higher recall and precision than the original method. The topic analysis results using the key phrases are more accurate than those based on keys, and can express the topic information of the document clearly.

Key words: key phrase extraction, text feature, mutual information, topic analysis

摘要: SegPhrase算法是当前提取关键短语最新的技术,其提取关键短语的结果比传统方法具有更高的准确率和召回率。但是SegPhrase算法在关键短语的提取和质量评估方面还存在一些缺陷。为了提高关键短语提取的质量,实现对中文关键短语的有效提取,对SegPhrase算法进行了改进。在短语产生阶段,通过利用词串之间的互信息特征保留部分低频但关键的短语;在短语质量评估阶段,通过赋予不同特征不同的权重来对短语进行综合评估,选择更符合实际应用语境的短语。最后,为了验证提取的关键短语的质量,将提取的关键短语应用于文档主题分析。通过实验证明,改进的SegPhrase算法比原方法具有更高的召回率和准确率,该方法提取的关键短语的主题分析比基于关键词的主题分析更能够清晰准确地表达文档主题信息。

关键词: 关键短语提取, 文本特征, 互信息, 主题分析