代码特征自动提取方法

doi:10.3778/j.issn.1673-9418.2005048

摘要/Abstract

摘要：

神经网络在软件工程中的应用极大程度上缓解了传统的人工提取代码特征的压力。已有的研究往往将代码简化为自然语言或者依赖专家的领域知识来提取代码特征，简化为自然语言的处理方法过于简单，容易造成信息丢失，而引入专家制定启发式规则的模型往往过于复杂，可拓展性以及普适性不强。鉴于以上问题，提出了一种基于卷积和循环神经网络的自动代码特征提取模型，该模型借助代码的抽象语法树（AST）来提取代码特征。为了缓解因AST过于庞大而带来的梯度消失问题，对AST进行切割，转换成一个AST序列再作为模型的输入。该模型利用卷积网络提取代码中的结构信息，利用双向循环神经网络提取代码中的序列信息。整个流程不需要专家的领域知识来指导模型的训练，只需要将标注类别的代码作为模型的输入就可以让模型自动地学习如何提取代码特征。应用训练好的分类编码器，在相似代码搜索任务上进行测试，Top1、NDCG、MRR的值分别能达到0.560、0.679和0.638，对比当下前沿的用于代码特征提取的深度学习模型以及业界常用的代码相似检测工具有显著的优势。

关键词: 代码特征提取, 代码分类, 程序理解, 相似代码搜索

Abstract:

The application of neural networks in software engineering has greatly eased the pressure of traditional method of extracting code features manually. Previous code feature extraction models usually regard code as natural language or heavily depend on the domain knowledge of experts. The method of transferring code into natural language is too simple and can easily cause information loss. However, the model with heuristic rules designed by experts is usually too complicated and lacks of expansibility and generalization. In regard of the problems above, this paper proposes a model based on convolutional neural network and recurrent neural network to extract code features through abstract syntax tree (AST). To solve the problem of gradient vanishing caused by the huge size of AST, this paper splits the AST into a sequence of small ASTs and then feeds these trees into the model. The model uses convolutional neural network and recurrent neural network to extract structure information and sequence information respectively. The whole procedure doesn??t need to introduce the domain knowledge of experts to guide the model training and the model will automatically learn how to extract features through the codes which have been labeled classification. This paper uses the task of similar code search to test the performance of the trained encoder, the metric of Top1, NDCG and MRR is 0.560, 0.679 and 0.638 respectively. Compared with recent state-of-the-art feature extraction deep learning models and common similar code detection tools, the proposed model has significant advantages.

Key words: code feature extraction, code classification, program comprehension, similar code search

史志成, 周宇. 代码特征自动提取方法[J]. 计算机科学与探索, 2021, 15(3): 456-467.

SHI Zhicheng, ZHOU Yu. Method of Code Features Automated Extraction[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(3): 456-467.

参考文献

[1] HINDLE A, BARR E T, SU Z D, et al. On the naturalness of software[C]//Proceedings of the 34th International Conference on Software Engineering, Zurich, Jun 2-9, 2012. Washington: IEEE Computer Society, 2012: 837-847.
[2] KAMIYA T, KUSUMOTO S, INOUE K. CCFinder: a multilinguistic token-based code clone detection system for large scale source code[J]. IEEE Transactions on Software Engineering, 2002, 28(7): 654-670.
[3] SAJNANI H, SAINI V, SVAJLENKO J, et al. SourcererCC: scaling code clone detection to big-code[C]//Proceedings of the 38th International Conference on Software Engineering, Austin, May 14-22, 2016. New York: ACM, 2016: 1157-1168.
[4] ZHOU J, ZHANG H Y, LO D. Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports[C]//Proceedings of the 34th International Conference on Software Engineering, Zurich, Jun 2-9, 2012. Washington: IEEE Computer Society, 2012: 14-24.
[5] FRANTZESKOU G, MACDONELL S, STAMATATOS E, et al. Examining the significance of high-level programming features in source code author classification[J]. Journal of Systems Software, 2008, 81(3): 447-460.
[6] ZHOU Y, YANG X, CHEN T, et al. Boosting API recommendation with implicit feedback[J]. arXiv:2002.01264, 2020.
[7] ZHOU Y, YAN X, YANG W, et al. Augmenting Java method comments generation with context information based on neural networks[J]. Journal of Systems Software, 2019, 156: 328-340.
[8] HU X, LI G, XIA X, et al. Deep code comment generation[C]//Proceedings of the 26th Conference on Program Comprehension, Gothenburg, May 27-28, 2018. New York: ACM, 2018: 200-210.
[9] WHITE M, TUFANO M, VENDOME C, et al. Deep learning code fragments for code clone detection[C]//Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, Singapore, Sep 3-7, 2016. New York: ACM, 2016: 87-98.
[10] Alon U, Brody S, Levy O, et al. code2seq: generating sequences from structured representations of code[J]. arXiv:1808.01400, 2018.
[11] WAN Y, ZHAO Z, YANG M, et al. Improving automatic source code summarization via deep reinforcement learning[C]//Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, Montpellier, Sep 3-7, 2018. New York: ACM, 2018: 397-407.
[12] MOU L L, LI G, ZHANG L, et al. Convolutional neural networks over tree structures for programming language processing[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, Feb 12-17, 2016. Menlo Park: AAAI, 2016: 1287-1293.
[13] BENGIO Y, SIMARD P, FRASCONI P. Learning long-term dependencies with gradient descent is difficult[J]. IEEE Transactions on Neural Networks, 1994, 5(2): 157-166.
[14] HOCHREITER S. The vanishing gradient problem during learning recurrent neural nets and problem solutions[J]. International Journal of Uncertainty, Fuzziness Knowledge-Based Systems, 1998, 6(2): 107-116.
[15] LE P, ZUIDEMA W. Quantifying the vanishing gradient and long distance dependency problem in recursive neural networks and recursive LSTMs[J]. arXiv:1603.00423, 2016.
[16] ZHANG J, WANG X, ZHANG H Y, et al. A novel neural source code representation based on abstract syntax tree[C]//Proceedings of the 41st International Conference on Software Engineering, Montreal, May 25-31, 2019. Piscataway: IEEE, 2019: 783-794.
[17] SCHUSTER M, PALIWAL K K. Bidirectional recurrent neural networks[J]. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681.
[18] GERS F A, SCHMIDHUBER J, CUMMINS F. Learning to forget: continual prediction with LSTM[J]. Neural Computation, 2000, 12(10): 2451-2471.
[19] HENKEL J, LAHIRI S K, LIBLIT B, et al. Code vectors: understanding programs through embedded abstracted symbolic traces[C]//Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Lake Buena Vista, Nov 4-9, 2018. New York: ACM, 2018: 163-174.
[20] GU X D, ZHANG H Y, ZHANG D M, et al. Deep API learning[C]//Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Seattle, Nov 13-18, 2016. New York: ACM, 2016: 631-642.
[21] NGUYEN T D, NGUYEN A T, PHAN H D, et al. Exploring API embedding for API usages and applications[C]//Proceedings of the 39th International Conference on Software Engineering, Buenos Aires, May 20-28, 2017. Piscataway: IEEE, 2017: 438-449.
[22] Pradel M, Sen K J T D. Deep learning to find bugs: TUD-CS-2017-0295[R]. TU Darmstadt, 2017.
[23] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv:1301.3781, 2013.
[24] PéREZ D, CHIBA S. Cross-language clone detection by learning over abstract syntax trees[C]//Proceedings of the 16th International Conference on Mining Software Repositories, Montreal, May 26-27, 2019. Piscataway: IEEE, 2019: 518-528.
[25] SOCHER R, HUANG E H, PENNIN J, et al. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection[C]//Proceedings of the 25th Annual Conference on Neural Information Processing Systems, Granada, Dec 12-14, 2011. Red Hook: Curran Associates, 2011: 801-809.
[26] JOHNSON J, DOUZE M, JéGOU H. Billion-scale similarity search with GPUs[J]. arXiv:1702.08734v1, 2017.
[27] SCHLEIMER S, WILKERSON D S, Aiken A. Winnowing: local algorithms for document fingerprinting[C]//Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, Jun 9-12, 2003. Piscataway: IEEE, 2003: 76-85.
[28] JIANG L X, MISHERGHI G, SU Z D, et al. Deckard: scalable and accurate tree-based detection of code clones[C]//Proceedings of the 29th International Conference on Software Engineering, Minneapolis, May 20-26, 2007. Washington: IEEE Computer Society, 2007: 96-105.
[29] YAN X, ZHOU Y, HUANG Z Q. Code snippets recommendation based on sequence to sequence model[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(5): 731-739.
闫鑫, 周宇, 黄志球. 基于序列到序列模型的代码片段推荐[J]. 计算机科学与探索, 2020, 14(5): 731-739.
[30] TAI K S, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks[J]. arXiv:1503.00075, 2015.
[31] WEI H H, LI M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Aug 19-25, 2017: 3034-3040.
[32] OU M, CUI P, PEI J, et al. Asymmetric transitivity preserving graph embedding[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, Aug 13-17, 2016. New York: ACM, 2016: 1105-1114.
[33] ALLAMANIS M, BROCKSCHMIDT M, KHADEMI M. Learning to represent programs with graphs[J]. arXiv:1711. 00740, 2017.
[34] TUFANO M, WATSON C, BAVOTA G, et al. Deep learning similarities from different representations of source code[C]//Proceedings of the 2018 IEEE/ACM 15th International Conference on Mining Software Repositories, Gothenburg, May 27-Jun 3, 2018. Piscataway: IEEE, 2018: 542-553.
[35] MYERS E M. A precise inter-procedural data flow algorithm[C]//Proceedings of the 15th International Conference on Mining Software Repositories, Gothenburg, May 28-29, 2018. New York: ACM, 1981: 219-230.