Journal of Frontiers of Computer Science and Technology ›› 2021, Vol. 15 ›› Issue (3): 456-467.DOI: 10.3778/j.issn.1673-9418.2005048

• Science Researches • Previous Articles     Next Articles

Method of Code Features Automated Extraction

SHI Zhicheng, ZHOU Yu   

  1. 1.College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
    2.Key Laboratory for Safety-Critical Software Development and Verification, Ministry of Industry and Information Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
    3.State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
  • Online:2021-03-01 Published:2021-03-05

代码特征自动提取方法

史志成周宇   

  1. 1.南京航空航天大学 计算机科学与技术学院,南京 210016
    2.南京航空航天大学 高安全系统的软件开发与验证技术工信部重点实验室,南京 210016
    3.南京大学 软件新技术国家重点实验室,南京 210023

Abstract:

The application of neural networks in software engineering has greatly eased the pressure of traditional method of extracting code features manually. Previous code feature extraction models usually regard code as natural language or heavily depend on the domain knowledge of experts. The method of transferring code into natural language is too simple and can easily cause information loss. However, the model with heuristic rules designed by experts is usually too complicated and lacks of expansibility and generalization. In regard of the problems above, this paper proposes a model based on convolutional neural network and recurrent neural network to extract code features through abstract syntax tree (AST). To solve the problem of gradient vanishing caused by the huge size of AST, this paper splits the AST into a sequence of small ASTs and then feeds these trees into the model. The model uses convolutional neural network and recurrent neural network to extract structure information and sequence information respectively. The whole procedure doesn??t need to introduce the domain knowledge of experts to guide the model training and the model will automatically learn how to extract features through the codes which have been labeled classification. This paper uses the task of similar code search to test the performance of the trained encoder, the metric of Top1, NDCG and MRR is 0.560, 0.679 and 0.638 respectively. Compared with recent state-of-the-art feature extraction deep learning models and common similar code detection tools, the proposed model has significant advantages.

Key words: code feature extraction, code classification, program comprehension, similar code search

摘要:

神经网络在软件工程中的应用极大程度上缓解了传统的人工提取代码特征的压力。已有的研究往往将代码简化为自然语言或者依赖专家的领域知识来提取代码特征,简化为自然语言的处理方法过于简单,容易造成信息丢失,而引入专家制定启发式规则的模型往往过于复杂,可拓展性以及普适性不强。鉴于以上问题,提出了一种基于卷积和循环神经网络的自动代码特征提取模型,该模型借助代码的抽象语法树(AST)来提取代码特征。为了缓解因AST过于庞大而带来的梯度消失问题,对AST进行切割,转换成一个AST序列再作为模型的输入。该模型利用卷积网络提取代码中的结构信息,利用双向循环神经网络提取代码中的序列信息。整个流程不需要专家的领域知识来指导模型的训练,只需要将标注类别的代码作为模型的输入就可以让模型自动地学习如何提取代码特征。应用训练好的分类编码器,在相似代码搜索任务上进行测试,Top1、NDCG、MRR的值分别能达到0.560、0.679和0.638,对比当下前沿的用于代码特征提取的深度学习模型以及业界常用的代码相似检测工具有显著的优势。

关键词: 代码特征提取, 代码分类, 程序理解, 相似代码搜索