基于随机游走路径的分布式SimRank算法

doi:10.3778/j.issn.1673-9418.1405053

计算机科学与探索 ›› 2014, Vol. 8 ›› Issue (12): 1422-1431.DOI: 10.3778/j.issn.1673-9418.1405053

基于随机游走路径的分布式SimRank算法

刘恒，寇月+，申德荣，王泰明，于戈

东北大学信息科学与工程学院，沈阳 110004

出版日期:2014-12-01 发布日期:2014-12-08

Distributed SimRank Algorithm Based on Random Walk Path

LIU Heng, KOU Yue+, SHEN Derong, WANG Taiming, YU Ge

College of Information Science and Engineering, Northeastern University, Shenyang 110004, China

Online:2014-12-01 Published:2014-12-08

摘要/Abstract

摘要： SimRank算法是一种常用的相似性度量模型，它基于图的拓扑结构信息来衡量任意两个对象之间的相似程度。随着数据规模的不断增大，集中式SimRank算法已不适用，而已有的分布式SimRank算法在运行效率和扩展性等方面存在缺陷。针对上述问题，提出了一种两阶段的基于随机游走路径的分布式SimRank算法。第一阶段基于BSP（bulk synchronous parallel）模型建立随机游走路径索引信息，支持新路径的动态添加，并通过阈值过滤尽可能减少生成路径的数量；第二阶段利用第一阶段生成的索引信息，提出了基于MapReduce的分布式SimRank算法。最后，通过实验验证了算法的可行性和有效性。

关键词: 分布式SimRank, 随机游走路径, BSP模型, MapReduce

Abstract: SimRank is a widely used model for computing similarity, it measures similarity between objects based on graph topology. With the rapid increase of data, the way of centralized SimRank is not applicable and current distributed SimRank approaches have some drawbacks in efficiency and scalability. This paper presents a two-stage distributed SimRank algorithm based on random walk path. The first stage is data preprocessing and a Find-K-Paths algorithm based on BSP (bulk synchronous parallel) model is proposed. The algorithm can effectively build the index information of random walk path and support the dynamic adding of new paths. The number of the generated paths can be reduced by threshold filtering. In the second stage, based on the index information, a distributed SimRank algorithm is proposed under MapReduce. The experiments demonstrate the feasibility and effectiveness of the proposed algorithm.

Key words: distributed SimRank, random walk path, BSP model, MapReduce

刘恒，寇月，申德荣，王泰明，于戈. 基于随机游走路径的分布式SimRank算法[J]. 计算机科学与探索, 2014, 8(12): 1422-1431.

LIU Heng, KOU Yue, SHEN Derong, WANG Taiming, YU Ge. Distributed SimRank Algorithm Based on Random Walk Path[J]. Journal of Frontiers of Computer Science and Technology, 2014, 8(12): 1422-1431.

[1]	张敬伟，尚宏佳，钱俊彦，周萍，杨青. 非均匀数据分布下的MapReduce连接查询算法优化[J]. 计算机科学与探索, 2017, 11(5): 752-767.
[2]	郭心宇，岳昆，李劲，武浩，张彬彬. 面向评价数据中用户偏好发现的证据理论方法[J]. 计算机科学与探索, 2017, 11(2): 231-241.
[3]	李东，邓泽航，李祖立. 基于MapReduce的XML结构连接处理[J]. 计算机科学与探索, 2016, 10(8): 1080-1091.
[4]	胡志刚，景冬梅，陈柏林，杨柳. 基于Hadoop平台的语义数据查询策略研究[J]. 计算机科学与探索, 2016, 10(7): 948-958.
[5]	单观敏，董一鸿，何贤芒. 基于MapReduce的连续概率Skyline查询[J]. 计算机科学与探索, 2016, 10(2): 182-193.
[6]	尹子都，岳昆，武浩，付晓东，刘惟一. 基于记忆曲线的数据密集型动态用户行为建模[J]. 计算机科学与探索, 2016, 10(10): 1376-1386.
[7]	张安珍，门雪莹，王宏志，李建中，高宏. 大数据上基于Hadoop的不一致数据检测与修复算法[J]. 计算机科学与探索, 2015, 9(9): 1044-1055.
[8]	刘超，徐雅斌，武装. 微博社区快速发现方法[J]. 计算机科学与探索, 2015, 9(9): 1100-1107.
[9]	蒋勇，赵作鹏. 基于MapReduce模型的排序算法优化研究[J]. 计算机科学与探索, 2015, 9(4): 410-417.
[10]	孙鹤立，陈强，刘玮，黄健斌，邹建华. 利用MapReduce平台实现高效并行的频繁子图挖掘[J]. 计算机科学与探索, 2014, 8(7): 790-801.
[11]	燕彩蓉，张洋舜，徐光伟. 支持隐私保护的众包实体解析[J]. 计算机科学与探索, 2014, 8(7): 802-811.
[12]	师金钢，郑艳，孙焕良，栾方军. 云环境中海量数据的并行分组密码体制研究[J]. 计算机科学与探索, 2014, 8(2): 161-170.
[13]	王梅，邢露露，孙莉. 混合存储下的MapReduce启发式多表连接优化[J]. 计算机科学与探索, 2014, 8(11): 1334-1344.
[14]	周爽，鲍玉斌，王志刚，冷芳玲，于戈，邓超，郭磊涛. BHP：面向BSP模型的负载均衡Hash图数据划分[J]. 计算机科学与探索, 2014, 8(1): 40-50.
[15]	徐艺境，栾钟治，钱德沛，管刚，谢明. HDFS集群中功率预测控制策略的设计与分析[J]. 计算机科学与探索, 2013, 7(5): 394-404.

基于随机游走路径的分布式SimRank算法

Distributed SimRank Algorithm Based on Random Walk Path

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics