计算机科学与探索 ›› 2018, Vol. 12 ›› Issue (1): 49-64.DOI: 10.3778/j.issn.1673-9418.1610044

• 数据库技术 • 上一篇    下一篇

面向不确定文本数据的余弦相似性查询方法

朱命冬1,2+,徐立新1,申德荣2,寇  月2,聂铁铮2   

  1. 1. 河南工学院 计算机科学与技术系,河南 新乡 453003
    2. 东北大学 计算机科学与工程学院,沈阳 110819
  • 出版日期:2018-01-01 发布日期:2018-01-09

Methods for Similarity Query on Uncertain Data with Cosine Similarity Constraints

ZHU Mingdong1,2+, XU Lixin1, SHEN Derong2, KOU Yue2, NIE Tiezheng2   

  1. 1. Department of Computer Science and Technology, Henan Institute of Technology, Xinxiang, Henan 453003, China
    2. School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
  • Online:2018-01-01 Published:2018-01-09

摘要: 最近邻查询在多个领域具有广泛的应用,如组合过滤、基于位置的服务、决策支持系统等。而且随着Web信息实体抽取、隐私保护信息转化、图像识别等技术的发展和普及,在诸多领域,不确定性文本数据普遍存在,基于信息论的TF-IDF算法,可以将文本型的相似匹配转化为数值型的向量的计算,具有严密性和有效性。但TF-IDF信息的余弦距离不属于度量空间,难于构建索引。为此主要研究了面向不确定文本数据基于余弦相似度的相似性查询方法。通过分析不确定性余弦相似度计算的特性,提出了快速相似度计算方法。通过对余弦距离的计算进行转换,构建改进的索引结构sMVP-tree(statistic multiple vantage point tree),并给出了基于余弦相似度面向不确定性数据的相似度计算方法。最后,结合该相似度计算方法提出了分布式环境下[kNN]查询和[RkNN]查询算法。大量的基于真实数据的实验验证了算法的正确性和有效性。

关键词: 不确定数据, 分布式算法, 余弦相似度, 相似性查询

Abstract: Nearest neighbor queries have been used in a wide variety of applications such as collaborative filtering, location-based services and decision support systems. Meanwhile, with the development of entity extraction in Web information, information transformation in privacy protection, text recognition in images, in many fields, uncertain text information is ubiquitous. In the field of information theory, the calculation of textual similarity is transformed to the computation of vector similarity by TF-IDF algorithm, which is rigorous and efficient. However, cosine distance based on TF-IDF does not belong to metric distance function, and it is difficult to build indices on cosine similarity. To this end, this paper studies methods for nearest neighbor queries on uncertain data with cosine similarity constraints. Existing methods are efficient either for numerical data or for certain data, but there is no method that can efficiently support uncertain and character data. So this paper first analyzes the property of cosine similarity to boost up similarity computation. Secondly, this paper proposes an efficient method for similarity queries on uncertain data by transforming cosine similarity computation, and designs an improved tree index for metric space, sMVP-tree (statistic multiple vantage point tree). Lastly, this paper extends the framework to a distributed environment and presents kNN query and RkNN algorithms. The experimental results show that the proposed method is effective and efficient.

Key words: uncertain data, distributed algorithm, cosine similarity, similarity query