Journal of Frontiers of Computer Science and Technology ›› 2017, Vol. 11 ›› Issue (4): 608-618.DOI: 10.3778/j.issn.1673-9418.1604029

Previous Articles     Next Articles

Research on Multi-Feature Sentence Similarity Computing Method with Word Embedding

LI Feng1,2+, HOU Jiaying3, ZENG Rongren1, LING Chen1   

  1. 1. Logistics Science Research Institute of PLA, Beijing 100166, China
    2. School of Computer Science and Engineering, Beihang University, Beijing 100191, China
    3. School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650504, China
  • Online:2017-04-12 Published:2017-04-12


李  峰1,2+,侯加英3,曾荣仁1,凌  晨1   

  1. 1. 中国人民解放军后勤科学研究所,北京 100166
    2. 北京航空航天大学 计算机学院,北京 100191
    3. 昆明理工大学 信息工程与自动化学院,昆明 650504

Abstract: Based on the summarization of sentence similarity computing methods, this paper applies 34 000 pieces of texts of People's Daily to train word vector space model for semantic similarity computing. Then, based on the trained word vector model, this paper designs a multi-feature sentence similarity computing method, which takes both word and sentence structure features into consideration. Firstly, the method takes note of possible effects of the number of overlapping words and word continuity, and then applies word vector model to calculate the semantic similarity of non-overlapping words. Regarding the aspect of sentence structure, the method takes both overlapping word order and sentence length conformity into consideration. Finally, this paper designs and implements four different sentence similarity calculating methods, and further develops an experimental system. The experimental results show that the method proposed in this paper can get satisfactory results and the combination and optimization upon the features of words and sentence structures can improve the accuracy of sentence similarity calculating.

Key words: word embedding, sentence similarity, Word2vec, algorithm design

摘要: 在归纳常见的句子相似度计算方法后,基于《人民日报》3.4万余份文本训练了用于语义相似度计算的词向量模型,并设计了一种融合词向量的多特征句子相似度计算方法。该方法在词方面,考虑了句子中重叠的词数和词的连续性,并运用词向量模型测量了非重叠词间的相似性;在结构方面,考虑了句子中重叠词的语序和两个句子的长度一致性。实验部分设计实现了4种句子相似度计算方法,并开发了相应的实验系统。结果表明:提出的算法能够取得相对较好的实验结果,对句子中词的语义特征和句子结构特征进行组合处理和优化,能够提升句子相似度计算的准确性。

关键词: 词向量, 句子相似度, Word2vec, 算法设计