Journal of Frontiers of Computer Science and Technology ›› 2014, Vol. 8 ›› Issue (4): 417-426.DOI: 10.3778/j.issn.1673-9418.1311018

Previous Articles     Next Articles

Code Clone Detection Method for Large-Scale Source Code

GUO Ying1,2, CHEN Fenghong1,2, ZHOU Minghui1,2+   

  1. 1. Institute of Software, School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China
    2. Key Laboratory of High Confidence Software Technologies of Ministry of Education, Peking University, Beijing 100871, China
  • Online:2014-04-01 Published:2014-04-03

大规模代码克隆的检测方法

郭  颖1,2,陈峰宏1,2,周明辉1,2+   

  1. 1. 北京大学 信息科学技术学院 软件研究所,北京 100871
    2. 北京大学 高可信软件技术教育部重点实验室,北京 100871

Abstract: The benefits of detecting code clones include detecting plagiarism and copyright infringement, helping in code compacting, error detecting, and finding usage patterns et al. The existing clone detection tools usually use complicated algorithm, or need lots of computing resources, so they can not be applied to detect code clones on large-scale code data. In order to implement code clone detection on massive data, this paper proposes a new code clone detection algorithm. The algorithm combines the idea of content-defined chunking (CDC) in data de-duplication and that of Simhash algorithm in finding duplicate webpage, and uses the method of first chunking then fuzzy matching. The algorithm is implemented on a data source which contains more than 500 million files of 10 TB from a variety of open source projects. This paper compares the influence of choosing different chunk lengths on detection rate and detection time. The experimental results show that the new algorithm can be applied not only to detect large scale code clones, but also to detect some Type 3 clones, with a high detection precision.

Key words: code clone, detection, large-scale code data

摘要: 代码克隆检测在剽窃检测、版权侵犯调查、软件演化分析、代码压缩、错误检测,以及寻找bug,发现复用模式等方面有重要作用。现有的代码克隆检测工具算法复杂,或需要消耗大量的计算资源,不适用于规模巨大的代码数据。为了能够在大规模的数据上检测代码克隆,提出了一种新的代码克隆检测算法。该算法结合数据消重中的基于内容可变长度分块(content-defined chunking,CDC)思想和网页查重中的Simhash算法思想,采用了对代码先分块处理再模糊匹配的方法。在一个包含多种开源项目,超过5亿个代码文件,共约10 TB代码内容的数据源上,实现了该算法。通过实验,比较了不同分块长度对代码克隆检测率和所需要时间的影响,验证了新算法可以运用于大规模代码克隆检测,并且能够检测出一些级别3的克隆代码,达到了较高的准确率。

关键词: 代码克隆, 检测, 大规模代码数据