计算机科学与探索 ›› 2011, Vol. 5 ›› Issue (1): 38-49.DOI: 10.3778/j.issn.1673-9418.2011.01.004

• 学术研究 • 上一篇    下一篇

高效的两轮远程文件快速同步算法

徐 旦 1+, 生拥宏2, 鞠大鹏2 , 吴建平1, 汪东升2,3   

  1. 1. 北京邮电大学计算机科学与技术学院, 北京 100876
    2. 清华大学计算机科学与技术系, 北京 100084
    3. 清华大学信息科学与技术国家实验室, 北京 100084
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-01-01 发布日期:2011-01-01
  • 通讯作者: 徐 旦

High Effective Two-round Remote File Fast Synchronization Algorithm

XU Dan1+, SHENG Yonghong2, JU Dapeng2, WU Jianping1, WANG Dongsheng2,3

  

  1. 1. School of Computer Science and Technology, Beijing University of Posts and Telecommunications, Beijing
    100876, China
    2. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
    3. National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-01-01 Published:2011-01-01
  • Contact: XU Dan

摘要: 远程文件快速同步在文件备份与恢复、Web 与ftp 网站镜像、内容分发网络、Web 访问中具有广泛的应用。提出了一种高效的基于内容变长分块和定长滑动块相结合的两轮快速文件同步算法——tpsync。同步算法分两轮进行, 第一轮利用基于内容可变分块技术在粗粒度上定位待同步文件的局部变化数据段,第二轮对局部变化数据段采用定长滑动切块技术在细粒度上查找出差异数据, 最终通过两轮数据交互实现文件的同步。将tpsync 与传统的单轮同步算法rsync 进行了对比实验, 通过对文本、二进制和数据库三种文件类型相似版本之间的同步实验, 结果表明tpsync 在平均同步时间和网络传输数据量两个方面均优于rsync。

关键词: 重复数据检测, 文件同步, rsync 算法

Abstract: Fast remote file synchronization has a widespread application in many scenarios such as the file backup and recovery, Web and ftp site mirroring, content distribution network, Web access and so on. This paper presents a high effective two-round fast synchronization algorithm tpsync which combines content-based variable-sized chunk and fixed-sized sliding block methods. tpsync is implemented with two rounds. For the first round, tpsync adopts content-based variable-sized chunk to locate the local change between similar files in coarse-grained scale. In the second round, tpsync looks up the differential data in the local changed data segment with fixed-sized sliding block method in fine-grained scale, and finally achieves the file synchronization by two-round data interaction. This paper executes a comparison experiment between tpsync and the traditional single-round synchronization method rsync.Extensive experiments on text, binary and database files demonstrate that tpsync can achieve a higher performance on average synchronization time and the amount of network traffic data than rsync.

Key words: duplicated data detection, file synchronization, rsync

中图分类号: