计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (10): 1656-1669.DOI: 10.3778/j.issn.1673-9418.1910019

• 学术研究 • 上一篇    下一篇

基于程序向量树的代码克隆检测

曾杰,贲可荣,张献,李晓伟,周全   

  1. 1. 海军工程大学 电子工程学院,武汉 430033
    2. 北京京航计算通讯研究所,北京 100074
    3. 武汉大学 计算机学院,武汉 430072
  • 出版日期:2020-10-01 发布日期:2020-10-12

Code Clone Detection Based on Program Vector Tree

ZENG Jie, BEN Kerong, ZHANG Xian, LI Xiaowei, ZHOU Quan   

  1. 1. College of Electronic Engineering, Navy University of Engineering, Wuhan 430033, China
    2. Jinghang Research Institute of Computing and Communication, Beijing 100074, China
    3. School of Computer Science, Wuhan University, Wuhan 430072, China
  • Online:2020-10-01 Published:2020-10-12

摘要:

代码克隆能够加速软件开发但是也会导致缺陷重复发生和软件质量问题。部分类型的代码克隆在字面上相似度低,导致识别困难。针对这一问题,提出一种基于程序向量树的代码克隆检测方法。首先,基于统计语言模型抽取词法单元的特征表示,分析不同字面单词之间的语义相似性;接着,通过语法分析提取程序的抽象语法树(AST),为叶子节点赋予对应字面单词的特征表示,将抽象语法树转化为程序向量树;最后,提出一种加权编码规则,在考虑区分不同树节点重要程度的基础上,将程序向量树编码为定长向量,而具有相似向量表示的代码片段被判定为代码克隆。实验结果表明,在真实代码克隆的大规模标准数据集BigCloneBench上,针对在字面上相似度较低的Moderately Type-3和Type-4类型克隆进行检测时,该方法均优于当前的主流方法,包括NiCad、Deckard、SourcererCC和Oreo等,证实了该方法的有效性。

关键词: 代码克隆, 代码克隆检测, 抽象语法树(AST), 程序向量树

Abstract:

Code cloning facilitates software development but also causes recurring bugs and software quality problems. Some types of code clones have very low similarity in literal, leading to difficulty of detection. Aiming at this pro-blem, this paper proposes one method of code clone detection based on the program vector tree. First, the feature representations of lexical units are extracted based on a statistical language model and the semantic similarities between different literal words are analyzed. Second, the abstract syntax tree (AST) of each program is extracted by syntactical analysis, and each AST is transformed into a program vector tree with each leaf node assigned a feature representation of the corresponding literal word. Finally, one weighted encoding mechanism is proposed for encoding each program vector tree into a fixed-sized vector, considering different weight information of nodes in the tree, and code fragments with similar vector representations are reported as code clones. Experimental results on BigClone-Bench, an existing large benchmark of real code clones, show that this method outperforms many prominent clone detection methods, including NiCad, Deckard, SourcererCC and Oreo, etc., in detecting Moderately Type-3 or Type-4 clones that have low similarity in literal, which verifies the validity of this method.

Key words: code clone, code clone detection, abstract syntax tree (AST), program vector tree