Methods for Similarity Query on Uncertain Data with Cosine Similarity Constraints

doi:10.3778/j.issn.1673-9418.1610044

Abstract

Abstract: Nearest neighbor queries have been used in a wide variety of applications such as collaborative filtering, location-based services and decision support systems. Meanwhile, with the development of entity extraction in Web information, information transformation in privacy protection, text recognition in images, in many fields, uncertain text information is ubiquitous. In the field of information theory, the calculation of textual similarity is transformed to the computation of vector similarity by TF-IDF algorithm, which is rigorous and efficient. However, cosine distance based on TF-IDF does not belong to metric distance function, and it is difficult to build indices on cosine similarity. To this end, this paper studies methods for nearest neighbor queries on uncertain data with cosine similarity constraints. Existing methods are efficient either for numerical data or for certain data, but there is no method that can efficiently support uncertain and character data. So this paper first analyzes the property of cosine similarity to boost up similarity computation. Secondly, this paper proposes an efficient method for similarity queries on uncertain data by transforming cosine similarity computation, and designs an improved tree index for metric space, sMVP-tree (statistic multiple vantage point tree). Lastly, this paper extends the framework to a distributed environment and presents kNN query and RkNN algorithms. The experimental results show that the proposed method is effective and efficient.

Key words: uncertain data, distributed algorithm, cosine similarity, similarity query

摘要： 最近邻查询在多个领域具有广泛的应用，如组合过滤、基于位置的服务、决策支持系统等。而且随着Web信息实体抽取、隐私保护信息转化、图像识别等技术的发展和普及，在诸多领域，不确定性文本数据普遍存在，基于信息论的TF-IDF算法，可以将文本型的相似匹配转化为数值型的向量的计算，具有严密性和有效性。但TF-IDF信息的余弦距离不属于度量空间，难于构建索引。为此主要研究了面向不确定文本数据基于余弦相似度的相似性查询方法。通过分析不确定性余弦相似度计算的特性，提出了快速相似度计算方法。通过对余弦距离的计算进行转换，构建改进的索引结构sMVP-tree（statistic multiple vantage point tree），并给出了基于余弦相似度面向不确定性数据的相似度计算方法。最后，结合该相似度计算方法提出了分布式环境下[kNN]查询和[RkNN]查询算法。大量的基于真实数据的实验验证了算法的正确性和有效性。

关键词: 不确定数据, 分布式算法, 余弦相似度, 相似性查询

ZHU Mingdong, XU Lixin, SHEN Derong, KOU Yue, NIE Tiezheng. Methods for Similarity Query on Uncertain Data with Cosine Similarity Constraints[J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(1): 49-64.

朱命冬，徐立新，申德荣，寇月，聂铁铮. 面向不确定文本数据的余弦相似性查询方法[J]. 计算机科学与探索, 2018, 12(1): 49-64.

[1]	CUI Meiyu, WAN Jing, HE Yunbin, LI Song. Uncertain Data Clustering Algorithm Based on Grid in Obstacle Space [J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(3): 408-417.
[2]	YU Jiaxi, LI Song, ZHANG Liping, LIU Lei. Probabilistic Obstacle k Aggregate Nearest Neighbor Query on Uncertain Data [J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(2): 231-240.
[3]	CAO Keyan, WANG Guoren, HAN Donghong, LI Shuoru. Top-k Outlier Detection Algorithm on Uncertain Data Stream [J]. Journal of Frontiers of Computer Science and Technology, 2015, 9(2): 172-181.
[4]	ZHU Mingdong, SHEN Derong, XIE Ning, YU Ge, KOU Yue, NIE Tiezheng. Distributed Similarity Query Method on Data with Relation Information [J]. Journal of Frontiers of Computer Science and Technology, 2014, 8(7): 778-789.
[5]	JIANG Yuankai, ZHENG Hongyuan. Clustering Algorithm over Uncertain Data Streams Based on Rough Fuzzy Set [J]. Journal of Frontiers of Computer Science and Technology, 2014, 8(12): 1494-1501.
[6]	ZHU Yunlei, YUE Kun, QIAN Wenhua, YANG Wenjing, LIU Weiyi. Time-Series Multi-Level Probabilistic Graphical Model for Representing Lineages over Uncertain Data [J]. Journal of Frontiers of Computer Science and Technology, 2013, 7(5): 460-471.
[7]	LI Jiajia, WANG Botao, WANG Guoren, HUANG Shan. A Survey of Query Processing Techniques over Uncertain Mobile Objects [J]. Journal of Frontiers of Computer Science and Technology, 2013, 7(12): 1057-1072.
[8]	CAO Jinfeng, DONG Yihong, WANG Yong, QIAN Jiangbo, ZHONG Caiming. Updating Queries for Probabilistic Skyline Set of Uncertain Moving Objects [J]. Journal of Frontiers of Computer Science and Technology, 2012, 6(5): 443-455.
[9]	WANG Guangdong, WANG Yijie, LI Xiaoyong, WANG Yuan. Parallel Skyline Computation over Uncertain Data Streams [J]. Journal of Frontiers of Computer Science and Technology, 2012, 6(12): 1116-1125.
[10]	CAO Keyan, WANG Guoren, HAN Donghong, YUAN Ye, HU Yachao, QI Baolei. Clustering Algorithm of Uncertain Data in Obstacle Space [J]. Journal of Frontiers of Computer Science and Technology, 2012, 6(12): 1087-1097.
[11]	ZHANG Zhiqiang, WEI Xiaoyan, XIE Xiaoqin. Using Dominate Relationship Analysis to Optimize Top-k Queries on Uncertain Data [J]. Journal of Frontiers of Computer Science and Technology, 2012, 6(11): 994-1006.
[12]	XIN Tingting, LIU Guohua. Top-k Queries under K-Anonymity Privacy Protection Model [J]. Journal of Frontiers of Computer Science and Technology, 2011, 5(8): 751-759.
[13]	JIANG Guohua, JIANG Shouxu, WANG Hongzhi, LI Jianzhong, GAO Hong. Query Processing on XML with Dirty Tags [J]. Journal of Frontiers of Computer Science and Technology, 2011, 5(8): 673-685.
[14]	PAN Shirui¹, ZHANG Yang^1,2+, LI Xue³, WANG Yong⁴. Nearest Neighbor Algorithm for Positive and Unlabeled Learning with Uncertainty [J]. Journal of Frontiers of Computer Science and Technology, 2010, 4(9): 769-779.
[15]	WANG Xiaowei⁺;HUANG Jiuming;JIA Yan . Probabilistic Skyline Computation on Distributed Uncertain Data* [J]. Journal of Frontiers of Computer Science and Technology, 2010, 4(10): 951-960.

Methods for Similarity Query on Uncertain Data with Cosine Similarity Constraints

面向不确定文本数据的余弦相似性查询方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics