Journal of Frontiers of Computer Science and Technology ›› 2021, Vol. 15 ›› Issue (3): 456-467.DOI: 10.3778/j.issn.1673-9418.2005048

• Science Researches • Previous Articles     Next Articles

Method of Code Features Automated Extraction

SHI Zhicheng, ZHOU Yu   

  1. 1.College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
    2.Key Laboratory for Safety-Critical Software Development and Verification, Ministry of Industry and Information Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
    3.State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
  • Online:2021-03-01 Published:2021-03-05



  1. 1.南京航空航天大学 计算机科学与技术学院,南京 210016
    2.南京航空航天大学 高安全系统的软件开发与验证技术工信部重点实验室,南京 210016
    3.南京大学 软件新技术国家重点实验室,南京 210023


The application of neural networks in software engineering has greatly eased the pressure of traditional method of extracting code features manually. Previous code feature extraction models usually regard code as natural language or heavily depend on the domain knowledge of experts. The method of transferring code into natural language is too simple and can easily cause information loss. However, the model with heuristic rules designed by experts is usually too complicated and lacks of expansibility and generalization. In regard of the problems above, this paper proposes a model based on convolutional neural network and recurrent neural network to extract code features through abstract syntax tree (AST). To solve the problem of gradient vanishing caused by the huge size of AST, this paper splits the AST into a sequence of small ASTs and then feeds these trees into the model. The model uses convolutional neural network and recurrent neural network to extract structure information and sequence information respectively. The whole procedure doesn??t need to introduce the domain knowledge of experts to guide the model training and the model will automatically learn how to extract features through the codes which have been labeled classification. This paper uses the task of similar code search to test the performance of the trained encoder, the metric of Top1, NDCG and MRR is 0.560, 0.679 and 0.638 respectively. Compared with recent state-of-the-art feature extraction deep learning models and common similar code detection tools, the proposed model has significant advantages.

Key words: code feature extraction, code classification, program comprehension, similar code search



关键词: 代码特征提取, 代码分类, 程序理解, 相似代码搜索