大规模代码克隆的检测方法

doi:10.3778/j.issn.1673-9418.1311018

计算机科学与探索 ›› 2014, Vol. 8 ›› Issue (4): 417-426.DOI: 10.3778/j.issn.1673-9418.1311018

大规模代码克隆的检测方法

郭颖1,2，陈峰宏1,2，周明辉1,2+

1. 北京大学信息科学技术学院软件研究所，北京 100871
2. 北京大学高可信软件技术教育部重点实验室，北京 100871

出版日期:2014-04-01 发布日期:2014-04-03

Code Clone Detection Method for Large-Scale Source Code

GUO Ying1,2, CHEN Fenghong1,2, ZHOU Minghui1,2+

1. Institute of Software, School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China
2. Key Laboratory of High Confidence Software Technologies of Ministry of Education, Peking University, Beijing 100871, China

Online:2014-04-01 Published:2014-04-03

摘要/Abstract

摘要： 代码克隆检测在剽窃检测、版权侵犯调查、软件演化分析、代码压缩、错误检测，以及寻找bug，发现复用模式等方面有重要作用。现有的代码克隆检测工具算法复杂，或需要消耗大量的计算资源，不适用于规模巨大的代码数据。为了能够在大规模的数据上检测代码克隆，提出了一种新的代码克隆检测算法。该算法结合数据消重中的基于内容可变长度分块（content-defined chunking，CDC）思想和网页查重中的Simhash算法思想，采用了对代码先分块处理再模糊匹配的方法。在一个包含多种开源项目，超过5亿个代码文件，共约10 TB代码内容的数据源上，实现了该算法。通过实验，比较了不同分块长度对代码克隆检测率和所需要时间的影响，验证了新算法可以运用于大规模代码克隆检测，并且能够检测出一些级别3的克隆代码，达到了较高的准确率。

关键词: 代码克隆, 检测, 大规模代码数据

Abstract: The benefits of detecting code clones include detecting plagiarism and copyright infringement, helping in code compacting, error detecting, and finding usage patterns et al. The existing clone detection tools usually use complicated algorithm, or need lots of computing resources, so they can not be applied to detect code clones on large-scale code data. In order to implement code clone detection on massive data, this paper proposes a new code clone detection algorithm. The algorithm combines the idea of content-defined chunking (CDC) in data de-duplication and that of Simhash algorithm in finding duplicate webpage, and uses the method of first chunking then fuzzy matching. The algorithm is implemented on a data source which contains more than 500 million files of 10 TB from a variety of open source projects. This paper compares the influence of choosing different chunk lengths on detection rate and detection time. The experimental results show that the new algorithm can be applied not only to detect large scale code clones, but also to detect some Type 3 clones, with a high detection precision.

Key words: code clone, detection, large-scale code data

郭颖，陈峰宏，周明辉. 大规模代码克隆的检测方法[J]. 计算机科学与探索, 2014, 8(4): 417-426.

GUO Ying, CHEN Fenghong, ZHOU Minghui. Code Clone Detection Method for Large-Scale Source Code[J]. Journal of Frontiers of Computer Science and Technology, 2014, 8(4): 417-426.

215

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	0	0	215

来源	本网站	其他网站

次数	153	62
比例	71%	29%

摘要

344

最新录用	在线预览	正式出版

0	0	344

	来源	本网站

	次数	344
	比例	100%

[1]	王迪聪, 白晨帅, 邬开俊. 基于深度学习的视频目标检测综述[J]. 计算机科学与探索, 2021, 15(9): 1563-1577.
[2]	汪哲, 任怡, 周凯, 管剑波, 谭郁松. 基于代码克隆检测的操作系统脆弱性分析方法[J]. 计算机科学与探索, 2021, 15(9): 1619-1631.
[3]	任龙杰, 孙颖, 丁卫平, 鞠恒荣, 曹金鑫. 基于单种群蛙跳优化CNN的眼底图像多病变检测[J]. 计算机科学与探索, 2021, 15(9): 1762-1772.
[4]	武晓栋, 刘敬浩, 金杰, 毛思平. 基于DT及PCA的DNN入侵检测模型[J]. 计算机科学与探索, 2021, 15(8): 1450-1458.
[5]	马煜, 杜慧敏, 毛智礼, 张霞. 深度语义分割人群密度检测技术[J]. 计算机科学与探索, 2021, 15(8): 1469-1475.
[6]	马玉琨, 徐姚文, 赵欣, 徐涛, 王泽瑞. 人脸识别系统的活体检测综述[J]. 计算机科学与探索, 2021, 15(7): 1195-1206.
[7]	方钧婷, 谭晓阳. 注意力级联网络的金属表面缺陷检测算法[J]. 计算机科学与探索, 2021, 15(7): 1245-1254.
[8]	缪佳妮, 杨金龙, 程小雪, 葛洪伟. 运动信息优化相关滤波的多目标跟踪算法[J]. 计算机科学与探索, 2021, 15(7): 1310-1321.
[9]	周燕, 刘紫琴, 曾凡智, 周月霞, 陈嘉文, 罗粤. 深度学习的二维人体姿态估计综述[J]. 计算机科学与探索, 2021, 15(4): 641-657.
[10]	马丹, 万良, 程琪芩, 孙志强. Attention-CNN在恶意代码检测中的应用研究[J]. 计算机科学与探索, 2021, 15(4): 670-681.
[11]	王新文, 谢林柏, 彭力. 时序行为提名的上下文信息融合方法[J]. 计算机科学与探索, 2021, 15(3): 486-494.
[12]	张晶, 黄浩淼. 结合重检测机制的多卷积层特征响应跟踪算法[J]. 计算机科学与探索, 2021, 15(3): 533-544.
[13]	史彩娟, 张卫明, 陈厚儒, 葛录录. 基于深度学习的显著性目标检测综述[J]. 计算机科学与探索, 2021, 15(2): 219-232.
[14]	陈睿龙, 罗磊, 蔡志平, 马文涛. 基于深度学习的实时吸烟检测算法[J]. 计算机科学与探索, 2021, 15(2): 327-337.
[15]	王晨, 郭春, 申国伟, 崔允贺. 利用序列分析的远控木马早期检测方法研究[J]. 计算机科学与探索, 2021, 15(12): 2315-2326.

大规模代码克隆的检测方法

Code Clone Detection Method for Large-Scale Source Code

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐 0

Metrics