Journal of Frontiers of Computer Science and Technology ›› 2018, Vol. 12 ›› Issue (11): 1718-1728.DOI: 10.3778/j.issn.1673-9418.1709030

Previous Articles     Next Articles

Method for Document Query under Word Mover's Distance Metric

WANG Weidi, CHEN Ke, HU Tianlei, CHEN Gang, SHOU Lidan   

  1. College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
  • Online:2018-11-01 Published:2018-11-12

一种单词移动距离度量下的文档查询方法研究

王伟迪陈珂胡天磊陈刚寿黎但   

  1. 浙江大学 计算机科学与技术学院,杭州 310027

Abstract:

Recently, a good metric, word mover's distance (WMD) has been proposed to measure document similarity, which combines the Word2Vec model and computes the distance between two documents by earth mover's distance (EMD). However, it has two drawbacks. Firstly, it takes imprecise term frequency as word weight. Secondly, the existing query method is extremely inefficient under WMD metric. In order to increase the accuracy of document applications when using WMD, TF-IDF (term frequency-inverse document frequency) is taken as word weight and an improved WMD(TI-WMD) is achieved. To improve the efficiency of document query under WMD metric, this paper proposes an approximate hierarchical query mehod(HQ). First of all, according to word centroid vector of every document, it constructs the indexes for the document data set by locality-sensitive hashing(LSH). In the query process, it achieves a candidate document set according to the word centroid of the query document and multi-probe LSH query method. And then, on the basis of the document signature and the filter-and-refinement frame, it can obtain the approximate k nearest neighbor under TI-WMD metric. The empirical study over Reuters-21578 and 20-Newsgroups data sets demonstrates the superior accuracy and efficiency of TI-WMD metric and the hierarchical query method, compared with WMD and PrefetchPrune under WMD metric.

Key words: word mover's distance, earth mover??s distance, locality-sensitive Hashing, approximate k nearest neighbor, hierarchical query

摘要:

单词移动距离(word mover??s distance,WMD)是最近提出的一种有效的文档相似性度量方式,其融合了Word2Vec词向量表达的语义信息,并依据推土机距离(earth mover??s distance,EMD)计算文档间的距离。然而,单词移动距离存在两个缺陷:第一点是它采用不够精确的词频来作为单词的权重;第二点是单词移动距离度量下的查询效率很低。为了改善应用单词移动距离时的效果,考虑到单词的重要性而采用TF-IDF(term frequency-inverse document frequency)评分作为单词权重,进而得到一种改进的单词移动距离(TI-WMD)。为了提高单词移动距离度量下的文档查询效率,提出了一种近似的层次化查询方法。首先,依据文档的单词质心向量采用局部敏感哈希为文档集合构建哈希索引。在查询过程中,依据查询文档的单词质心向量和多探寻局部敏感哈希方法获得候选文档集,接着依据文档标签与过滤-细化框架在候选文档集中获得TI-WMD度量下的近似k近邻。在Reuters-21578和20-Newsgroups两个文档数据集上的实验结果表明,相对于WMD与PrefetchPrune方法,TI-WMD与层次化查询在准确性和效率上更具优势。

关键词: 单词移动距离, 推土机距离, 局部敏感哈希, 近似[k]近邻, 层次化查询