计算机科学与探索 ›› 2014, Vol. 8 ›› Issue (12): 1422-1431.DOI: 10.3778/j.issn.1673-9418.1405053

• 学术研究 • 上一篇    下一篇

基于随机游走路径的分布式SimRank算法

刘  恒,寇  月+,申德荣,王泰明,于  戈   

  1. 东北大学 信息科学与工程学院,沈阳 110004
  • 出版日期:2014-12-01 发布日期:2014-12-08

Distributed SimRank Algorithm Based on Random Walk Path

LIU Heng, KOU Yue+, SHEN Derong, WANG Taiming, YU Ge   

  1. College of Information Science and Engineering, Northeastern University, Shenyang 110004, China
  • Online:2014-12-01 Published:2014-12-08

摘要: SimRank算法是一种常用的相似性度量模型,它基于图的拓扑结构信息来衡量任意两个对象之间的相似程度。随着数据规模的不断增大,集中式SimRank算法已不适用,而已有的分布式SimRank算法在运行效率和扩展性等方面存在缺陷。针对上述问题,提出了一种两阶段的基于随机游走路径的分布式SimRank算法。第一阶段基于BSP(bulk synchronous parallel)模型建立随机游走路径索引信息,支持新路径的动态添加,并通过阈值过滤尽可能减少生成路径的数量;第二阶段利用第一阶段生成的索引信息,提出了基于MapReduce的分布式SimRank算法。最后,通过实验验证了算法的可行性和有效性。

关键词: 分布式SimRank, 随机游走路径, BSP模型, MapReduce

Abstract: SimRank is a widely used model for computing similarity, it measures similarity between objects based on graph topology. With the rapid increase of data, the way of centralized SimRank is not applicable and current distributed SimRank approaches have some drawbacks in efficiency and scalability. This paper presents a two-stage distributed SimRank algorithm based on random walk path. The first stage is data preprocessing and a Find-K-Paths algorithm based on BSP (bulk synchronous parallel) model is proposed. The algorithm can effectively build the index information of random walk path and support the dynamic adding of new paths. The number of the generated paths can be reduced by threshold filtering. In the second stage, based on the index information, a distributed SimRank algorithm is proposed under MapReduce. The experiments demonstrate the feasibility and effectiveness of the proposed algorithm.

Key words: distributed SimRank, random walk path, BSP model, MapReduce