双缀过滤的大数据相似性连接处理算法

doi:10.3778/j.issn.1673-9418.1608045

计算机科学与探索 ›› 2017, Vol. 11 ›› Issue (8): 1235-1245.DOI: 10.3778/j.issn.1673-9418.1608045

双缀过滤的大数据相似性连接处理算法

邓诗卓，信俊昌，聂铁铮，王国仁+

东北大学计算机科学与工程学院，沈阳 110819

出版日期:2017-08-01 发布日期:2017-08-09

Big Data Similarity Join Processing Based on Prefix-Suffix Filtering

DENG Shizhuo, XIN Junchang, NIE Tiezheng, WANG Guoren+

School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China

Online:2017-08-01 Published:2017-08-09

摘要/Abstract

摘要： 相似性连接技术是实体识别和数据集成的关键技术之一，是挖掘数据中有价值信息的重要手段。随着大数据发展，传统的集中式相似性连接已经无法满足人们对数据处理的时效性需求，并且利用分布式计算可以提高相似性连接的执行效率。因此，深入研究了基于Spark的分布式相似性连接处理算法。针对仅使用后缀位置信息过滤方法的不足，提出了利用一条记录前缀与另一条记录后缀间共同元素位置信息来进行过滤的分布式相似性连接PSJoin，提高了相似性连接的处理效率，减少了相似性连接的执行时间。同时，针对基于权重的相似度连接算法的过滤问题，结合双缀过滤原理，通过一条记录前缀共同元素之后的第一个元素的权重与另一条记录后缀中元素权重大小的关系，提出了基于双缀过滤的分布式权重相似性连接WTPSJoin。为面向大数据的相似性连接计算提供了两种可靠的解决方案。两种算法在多数据源混合数据集上进行测试实验，实验结果表明，所提算法相对于已有的过滤算法过滤效果好，执行时间少，同时具有良好的加速比。

关键词: 相似性连接, 权重相似性连接, 大数据, 过滤, Spark

Abstract: Similarity join is one of the key techniques in entity identification and data integration which are significant for detecting valuable information. With the development of big data, it cannot satisfy the demand of efficiency to do the job on one machine. As a consequence, distributed computation becomes a better choice to improve the execution efficiency of similarity join. This paper gets a deeper understanding of processing algorithms for distributed similarity join based on Spark. Since the method using only suffix positional information for filtering has some shortcomings, this paper proposes a distributed similarity join processing method PSJoin, which uses the common token positional information between the prefix of one record and the suffix of another one. Also PSJoin can be applied to the weighted case with a little change for weight tokens. It compares the weight of the first token in the mix-suffix of one record with the weights of the tokens in the other record. The weighted similarity join with PSJoin is called WTPSJoin which improves the processing efficiency. The two new methods provide novel and efficient solutions for similarity join of big data. Experiments are tested on mixed datasets, and results show that the proposed algorithms have better performance in filtering, less running time cost and perfect speedup.

邓诗卓，信俊昌，聂铁铮，王国仁. 双缀过滤的大数据相似性连接处理算法[J]. 计算机科学与探索, 2017, 11(8): 1235-1245.

DENG Shizhuo, XIN Junchang, NIE Tiezheng, WANG Guoren. Big Data Similarity Join Processing Based on Prefix-Suffix Filtering[J]. Journal of Frontiers of Computer Science and Technology, 2017, 11(8): 1235-1245.

[1]	陈剑南，杜军平，薛哲，寇菲菲. 基于多重注意力的金融事件大数据精准画像[J]. 计算机科学与探索, 2021, 15(7): 1237-1244.
[2]	赵学武，吴宁，王军，阮利，李玲玲，徐涛. 航空大数据研究综述[J]. 计算机科学与探索, 2021, 15(6): 999-1025.
[3]	杨茸，牛保宁. 空间文本数据流上连续查询评估技术综述[J]. 计算机科学与探索, 2021, 15(4): 631-640.
[4]	郭子菁，罗玉川，蔡志平，郑腾飞. 医疗健康大数据隐私保护综述[J]. 计算机科学与探索, 2021, 15(3): 389-402.
[5]	郑娅峰，赵亚宁，白雪，傅骞. 教育大数据可视化研究综述[J]. 计算机科学与探索, 2021, 15(3): 403-422.
[6]	王沐贤，丁小欧，王宏志，李建中. 基于相关性的多维时序数据异常溯源方法[J]. 计算机科学与探索, 2021, 15(11): 2142-2150.
[7]	樊星，牛保宁. 区块链应用下的新型区块链布隆过滤器[J]. 计算机科学与探索, 2021, 15(10): 1921-1929.
[8]	邢长征，赵宏宝，张全贵，郭亚兰. 融合评论文本层级注意力和外积的推荐方法[J]. 计算机科学与探索, 2020, 14(6): 947-957.
[9]	包盼盼，陶传奇，黄志球. 面向开源源码大数据的数据质量研究[J]. 计算机科学与探索, 2020, 14(3): 389-400.
[10]	胡健，徐锴滨，毛伊敏. 基于加权网格和信息熵的并行密度聚类算法[J]. 计算机科学与探索, 2020, 14(12): 2094-2107.
[11]	王永贵，徐山珊，肖成龙. 无线城市社团发现的研究——在Spark上利用改进关联规则实现社团发现的算法[J]. 计算机科学与探索, 2019, 13(9): 1582-1592.
[12]	严晔晴，陈志刚，吴嘉，王磊磊. 融合社会关系的机会网络有效数据转发策略[J]. 计算机科学与探索, 2019, 13(5): 800-811.
[13]	王宇琛，王宝亮，侯永宏. 融合协同过滤与上下文信息的Bandits推荐算法[J]. 计算机科学与探索, 2019, 13(3): 361-373.
[14]	赵一宁，肖海力. 国家高性能计算环境事件流系统的设计[J]. 计算机科学与探索, 2019, 13(3): 374-382.
[15]	郭羽含，胡芳霞. 考虑匹配可行性的长期合乘问题建模与求解[J]. 计算机科学与探索, 2019, 13(11): 1894-1910.

双缀过滤的大数据相似性连接处理算法

Big Data Similarity Join Processing Based on Prefix-Suffix Filtering

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics