Journal of Frontiers of Computer Science and Technology ›› 2017, Vol. 11 ›› Issue (8): 1235-1245.DOI: 10.3778/j.issn.1673-9418.1608045

Previous Articles     Next Articles

Big Data Similarity Join Processing Based on Prefix-Suffix Filtering

DENG Shizhuo, XIN Junchang, NIE Tiezheng, WANG Guoren+   

  1. School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
  • Online:2017-08-01 Published:2017-08-09

双缀过滤的大数据相似性连接处理算法

邓诗卓信俊昌聂铁铮王国仁+   

  1. 东北大学 计算机科学与工程学院,沈阳 110819

Abstract: Similarity join is one of the key techniques in entity identification and data integration which are significant for detecting valuable information. With the development of big data, it cannot satisfy the demand of efficiency to do the job on one machine. As a consequence, distributed computation becomes a better choice to improve the    execution efficiency of similarity join. This paper gets a deeper understanding of processing algorithms for distributed similarity join based on Spark. Since the method using only suffix positional information for filtering has some shortcomings, this paper proposes a distributed similarity join processing method PSJoin, which uses the common token positional information between the prefix of one record and the suffix of another one. Also PSJoin can be      applied to the weighted case with a little change for weight tokens. It compares the weight of the first token in the mix-suffix of one record with the weights of the tokens in the other record. The weighted similarity join with PSJoin is called WTPSJoin which improves the processing efficiency. The two new methods provide novel and efficient    solutions for similarity join of big data. Experiments are tested on mixed datasets, and results show that the proposed algorithms have better performance in filtering, less running time cost and perfect speedup.

Key words: similarity join, weighted similarity join, big data, filtering, Spark

摘要: 相似性连接技术是实体识别和数据集成的关键技术之一,是挖掘数据中有价值信息的重要手段。随着大数据发展,传统的集中式相似性连接已经无法满足人们对数据处理的时效性需求,并且利用分布式计算可以提高相似性连接的执行效率。因此,深入研究了基于Spark的分布式相似性连接处理算法。针对仅使用后缀位置信息过滤方法的不足,提出了利用一条记录前缀与另一条记录后缀间共同元素位置信息来进行过滤的分布式相似性连接PSJoin,提高了相似性连接的处理效率,减少了相似性连接的执行时间。同时,针对基于权重的相似度连接算法的过滤问题,结合双缀过滤原理,通过一条记录前缀共同元素之后的第一个元素的权重与另一条记录后缀中元素权重大小的关系,提出了基于双缀过滤的分布式权重相似性连接WTPSJoin。为面向大数据的相似性连接计算提供了两种可靠的解决方案。两种算法在多数据源混合数据集上进行测试实验,实验结果表明,所提算法相对于已有的过滤算法过滤效果好,执行时间少,同时具有良好的加速比。

关键词: 相似性连接, 权重相似性连接, 大数据, 过滤, Spark