基于深度学习的代码表征及其应用综述

doi:10.3778/j.issn.1673-9418.2110073

计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (9): 2011-2029.DOI: 10.3778/j.issn.1673-9418.2110073

基于深度学习的代码表征及其应用综述

张祥平¹^,², 刘建勋¹^,²^,⁺()

1.湖南科技大学服务计算与软件服务新技术湖南省重点实验室,湖南湘潭 411201
2.湖南科技大学计算机科学与工程学院,湖南湘潭 411201

收稿日期:2021-10-28 修回日期:2022-04-21 出版日期:2022-09-01 发布日期:2022-09-15
通讯作者: + E-mail: ljx529@gmail.com
作者简介:张祥平(1993—),男,福建三明人,博士研究生,主要研究方向为代码表征、代码克隆检测。
刘建勋(1970—),男,湖南衡阳人,博士,教授,主要研究方向为服务计算、云计算。
基金资助:
国家自然科学基金(61872139)

Overview of Deep Learning-Based Code Representation and Its Applications

ZHANG Xiangping¹^,², LIU Jianxun¹^,²^,⁺()

1. Hunan Key Lab for Services Computing and Novel Software Technology, Hunan University of Science and Technology, Xiangtan, Hunan 411201, China
2. School of Computer Science and Engineering, Hunan University of Science and Technology, Xiangtan, Hunan 411201, China

Received:2021-10-28 Revised:2022-04-21 Online:2022-09-01 Published:2022-09-15
About author:ZHANG Xiangping, born in 1993, Ph.D. candidate. His research interests include code representation and code clone detection.
LIU Jianxun, born in 1970, Ph.D., professor. His research interests include service computing and cloud computing.
Supported by:
National Natural Science Foundation of China(61872139)

摘要/Abstract

摘要：

对程序进行分析、推理能够对软件开发、维护、迁移起到重要作用。如何高效地从程序代码中获取高质量信息成为了当前研究的热点。近几年有许多学者将基于深度学习的表征技术引入到程序代码分析任务中。深度学习模型能够自动地提取代码中所包含的隐含特征,降低对人工制定特征的依赖。首先介绍了代码表征的背景知识和基本概念,从代码静态信息分析角度出发,总结了基于深度学习的代码表征研究工作。之后进一步介绍了代码表征在代码克隆检测、代码搜索和代码补全三个任务上的具体应用。最后分析现有基于深度学习的代码表征工作中仍然存在的问题,并展望了未来可能的研究方向。

关键词: 代码表征, 表征学习, 软件工程, 代码分析, 深度学习

Abstract:

The analysis and inference of program play an important role in software development, maintenance and migration. How to efficiently obtain high quality information from program code has become a hot research topic. In recent years, a large number of researchers have introduced the deep learning-based representation technology into the code analysis tasks. The deep learning model can automatically extract the implicit and useful features implicit in the source code, which can alleviate the dependence on the manual construct feature. This paper first introduces the background and basic concepts of code representation, and summarizes the recent research work on deep learning-based code representation learning from the perspective of code static information analysis. Furthermore, this paper introduces the application of code representation on three tasks, code clone detection, code search and code completion. Finally, it discusses the challenges of deep learning-based code representation and the possible research directions in this field.

Key words: code representation, representation learning, software engineering, code analysis, deep learning

中图分类号:

TP391

张祥平, 刘建勋. 基于深度学习的代码表征及其应用综述[J]. 计算机科学与探索, 2022, 16(9): 2011-2029.

ZHANG Xiangping, LIU Jianxun. Overview of Deep Learning-Based Code Representation and Its Applications[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(9): 2011-2029.

图/表 6

参考文献 131

[1]	刘芳, 李戈, 胡星. 基于深度学习的程序理解研究进展[J]. 计算机研究与发展, 2019, 56(8): 1605-1620.
	LIU F, LI G, HU X. Program comprehension based on deep learning[J]. Journal of Computer Research and Develop-ment, 2019, 56(8): 1605-1620.
[2]	HINDLE A, BARR E T, SU Z, et al. On the naturalness of software[C]// Proceedings of the 2012 34th International Conference on Software Engineering, Zurich, Jun 2-9, 2012. Washington: IEEE Computer Society, 2012: 837-847.
[3]	ROBBES R, LANZA M. How program history can improve code completion[C]// Proceedings of the 2008 23rd IEEE/ACM International Conference on Automated Software Engineering, L'Aquila, Sep 15-19, 2008. Washington: IEEE Computer Society, 2008: 317-326.
[4]	PROKSCH S, LERCH J, MEZINI M. Intelligent code com-pletion with Bayesian networks[J]. ACM Transactions on Software Engineering and Methodology, 2015, 25(1): 1-31.
[5]	BIELIK P, RAYCHEV V, VECHEV M T. PHOG: probabi-listic model for code[C]// Proceedings of the 33rd Internat-ional Conference on Machine Learning, New York, Jun 19-24, 2016: 2933-2942.
[6]	OMORI T, KUWABARA H, MARUYAMA K. A study on repetitiveness of code completion operations[C]// Proceed-ings of the 2012 28th IEEE International Conference on Software Maintenance, Trento, Sep 23-28, 2012. Washin-gton: IEEE Computer Society, 2012: 584-587.
[7]	TU Z P, SU Z D, DEVANBU P T. On the localness of soft-ware[C]// Proceedings of the 22nd ACM SIGSOFT Interna-tional Symposium on Foundations of Software Engineer-ing, Hong Kong, China, Nov 16-22, 2014. New York: ACM, 2014: 269-280.
[8]	OSCAR K. TF-IDF inspired detection for cross-language source code plagiarism and collusion[J]. Computer Scie-nce, 2020, 21: 113-134.
[9]	LE T H, CHEN H, BABAR M A. Deep learning for source code modeling and generation: models, applications, and challenges[J]. ACM Computing Surveys, 2020, 53(3): 1-38.
[10]	ZHANG J, WANG X, ZHANG H Y. A novel neural source code representation based on abstract syntax tree[C]// Proce-edings of the 41st International Conference on Software Engineering, Montreal, May 25-31, 2019. Piscataway: IEEE, 2019: 783-794.
[11]	刘斌斌, 董威, 王戟. 智能化的程序搜索与构造方法综述[J]. 软件学报, 2018, 29(8): 2180-2197.
	LIU B B, DONG W, WANG J. Survey on intelligent search and construction methods of program[J]. Journal of Soft-ware, 2018, 29(8): 2180-2197.
[12]	WHITE M, TUFANO M, VENDOME C. Deep learning fragments for code clone detection[C]// Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, Singapore, Sep 3-7, 2016. New York: ACM, 2016: 87-98.
[13]	WHITE M, VENDOME C. Toward deep learning software repositories[C]// Proceedings of the 12th IEEE/ACM Working Conference on Mining Software Repositories, Florence, May 16-17, 2015. Washington: IEEE Computer Society, 2015: 334-345.
[14]	WHITE M, TUFANO M, MARTINEZ M, et al. Sorting and transforming program repair ingredients via deep learning code similarities[C]// Proceedings of the 26th IEEE Internat-ional Conference on Software Analysis, Evolution and Reengineering, Hangzhou, Feb 24-27, 2019. Piscataway: IEEE, 2019: 479-490.
[15]	WANG P P, SVAJLENKO J, WU Y Z, et al. CCAligner: a token based large-gap clone detector[C]// Proceedings of the 40th International Conference on Software Engineering, Gothenburg, May 27-Jun 3, 2018. New York: ACM, 2018: 1066-1077.
[16]	GU X D, ZHANG H Y, KIM S H. Deep code search[C]// Proceedings of the 40th International Conference on Soft-ware Engineering, Gothenburg, May 27-Jun 3, 2018. New York: ACM, 2018: 933-944.
[17]	ALON U, ZILBERSTEIN M, LEVY O. A general path-based representation for predicting program properties[C]// Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, Phila-delphia, Jun 18-22, 2018. New York: ACM, 2018: 404-419.
[18]	ALON U, ZILBERSTEIN M, LEVY O, et al. Code2vec: learning distributed representations of code[C]// Proceedings of the 2019 ACM on Programming Languages, Cascais, Jan 13-19, 2019. New York: ACM, 2019: 1-29.
[19]	MOU L L, LI G, ZHANG L. Convolutional neural network over tree structures for programming language processing[C]// Proceedings of the 30th AAAI Conference on Artific-ial Intelligence, Phoenix, Feb 12-17, 2016. Menlo Park: AAAI, 2016: 1287-1293.
[20]	BÜCH L, ANDRZEJAK A. Learning-based recursive aggr-egation of abstract syntax trees for code clone detection[C]// Proceedings of the 2019 IEEE 26th International Conf-erence on Software Analysis, Evolution and Reengineering, Hangzhou, Feb 24-27, 2019. Piscataway: IEEE, 2019: 95-104.
[21]	SAHLGREN M. The word-space model: using distrib-utional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces[D]. Stockholm: Institutionen för Lingvistik, 2006.
[22]	DUMAIS S T. Latent semantic analysis[J]. Annual Review of Information Science and Technology, 2004, 38(1): 188-230. DOI URL
[23]	BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3(1): 993-1022.
[24]	ŘEHŮŘEK R, SOJKA P. Software framework for topic modelling with large corpora[C]// Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Malta, May 22, 2010. Valletta: University of Malta, 2004: 45-50.
[25]	LE Q V, MIKOLOV T. Distributed representations of sent-ences and documents[C]// Proceedings of the 31st Interna-tional Conference on Machine Learning, Beijing, Jun 21-26, 2014: 1188-1196.
[26]	蹇松雷. 基于复杂异构数据的表征学习研究[D]. 长沙: 国防科技大学, 2019.
	JIAN S L. Research on the representation learning of com-plex heterogeneous data[D]. Changsha: National University of Defense Technology, 2019.
[27]	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient esti-mation of word representations in vector space[J]. arXiv:1301.3781, 2013.
[28]	KAUR A, NAYYAR R. A comparative study of static code analysis tools for vulnerability detection in C/C++and JAVA source code[J]. Procedia Computer Science, 2020, 171: 2023-2029. DOI URL
[29]	HARER J, KIM L, RUSSELL R, et al. Automated software vulnerability detection with machine learning[J]. arXiv:1803.04497, 2018.
[30]	CHEN Z M, MONPERRUS M. The remarkable role of similarity in redundancy-based program repair[J]. arXiv:1811.05703, 2018.
[31]	HENKEL J, LAHIRI S, LIBLIT B, et al. Code vectors: understanding programs through embedded abstracted sym-bolic traces[C]// Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Eng-ineering, Lake Buena Vista, Nov 4-9, 2018. New York: ACM, 2018: 163-174.
[32]	NGUYEN T D, NGUYEN A T, PHAN H D, et al. Expl-oring API embedding for API usages and applications[C]// Proceedings of the 39th International Conference on Software Engineering, Buenos Aires, May 20-28, 2017. Piscataway: IEEE, 2017: 438-449.
[33]	PRADEL M, SEN K. DeepBugs: a learning approach to name-based bug detection[J]. Proceedings of the ACM on Programming Languages, 2018, 2: 1-25.
[34]	IYER S, KONSTAS I, CHEUNG A. Summarizing source code using a neural attention model[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Aug 7-12, 2016. Stroudsburg: ACL, 2016: 2073-2083.
[35]	ALLAMANIS M, PENG H, SUTTON C. A convolutional attention network for extreme summarization of source code[C]// Proceedings of the 33rd International Conference on Machine Learning, New York, Jun 19-24, 2016: 2091-2100.
[36]	LI J, WANG Y, LYU M R, et al. Code completion with neural attention and pointer networks[J]. arXiv:1711.09573, 2017.
[37]	BHOOPCHAND A, ROCKSTASCHEL T, BARR E. Lear-ning python code suggestion with a sparse pointer network[J]. arXiv:1611.08307, 2016.
[38]	SHUAI J, XU L, LIU C, et al. Improving code search with co-attentive representation learning[C]// Proceedings of the 28th International Conference on Program Comprehension, Seoul, Jul 13-15, 2020. New York: ACM, 2020: 196-207.
[39]	GU X D, ZHANG H Y, ZHANG D M, et al. Deep API learning[C]// Proceedings of the 24th ACM SIGSOFT Inte-rnational Symposium on Foundations of Software Engine-ering, Seattle, Nov 13-18, 2016. New York: ACM, 2016: 631-642.
[40]	LU X F, JIANG F S, ZHOU X, et al. ASSCA: API sequence and statistics features combined architecture for malware detection[J]. Computer Networks, 2019, 157: 99-111. DOI URL
[41]	SAIFULLAH C M. Learning APIs through mining code snippet examples[D]. Saskatoon: University of Saskatchewan, 2020.
[42]	HU X, LI G, XIA X. Summarizing source code with transferred API knowledge[C]// Proceedings of the 27th International Joint Conference on Artificial Intelligence,Stockholm, Jul 13-19, 2018: 2269-2275.
[43]	SVAJLENKO J, ISLAM J F, KEIVANLOO I, et al. Towards a big data curated benchmark of inter-project code clones[C]// Proceedings of the 30th IEEE International Con-ference on Software Maintenance and Evolution, Victoria, Sep 29-Oct 3, 2014. Washington: IEEE Computer Society, 2014: 476-480.
[44]	WEI H H, LI M. Supervised deep features for software functional clone detection by exploiting lexical and synta-ctical information in source code[C]// Proceedings of the 26th International Joint Conference on Artificial Intelli-gence, Melbourne, Aug 19-25, 2017: 3034-3040.
[45]	CHEN L, YE W, ZHANG S K. Capturing source code semantics via tree-based convolution over API-enhanced AST[C]// Proceedings of the 16th ACM International Conf-erence on Computing Frontiers, Alghero, Apr 30-May 2, 2019. New York: ACM, 2019: 174-182.
[46]	WANG W H, LI G, MA B, et al. Detecting code clones with graph neural network and flow-augmented abstract syntax tree[C]// Proceedings of the 27th IEEE International Confer-ence on Software Analysis, Evolution and Reengineering,London, Feb 18-21, 2020. Piscataway: IEEE, 2020: 261-271.
[47]	HU X, LI G, XIA X. Deep code comment generation[C]// Proceedings of the 26th Conference on Program Compr-ehension, Gothenburg, May 27-28, 2018. New York: ACM, 2018: 200-210.
[48]	ALON U, BRODY S, LEVY O, et al. Code2seq: generating sequences from structured representations of code[J]. arXiv: 1808.01400, 2018.
[49]	ALON U, SADAKA R, LEVY O, et al. Structural language models of code[C]// Proceedings of the 2020 International Conference on Machine Learning. New York: ACM, 2020: 245-256.
[50]	ALLAMANIS M, BROCKSCHMIDT M, KHADEMI M. Learning to represent programs with graphs[J]. arXiv:1711.00740, 2017.
[51]	LU M M, TAN D W, XIONG N X, et al. Program classification using gated graph attention neural network for online programming service[J]. arXiv:1903.03804, 2019.
[52]	BROCKSCHMIDT M, ALLAMANIS M, GAUNT A L. Generative code modeling with graphs[J]. arXiv:1805.08490, 2018.
[53]	BEN-NUN T, JAKOBOVITS A S, HOEFLER T. Neural code comprehension: a learnable representation of code semantics[C]// Proceedings of the 32nd International Confe-rence on Neural Information Processing Systems, Montré-al, Dec 3-8, 2018: 3589-3601.
[54]	LI Z M, LU S, MYAGMAR S, et al. CP-Miner: finding copy-paste and related bugs in large-scale software code[J]. IEEE Transactions on Software Engineering, 2006, 32(3): 176-192. DOI URL
[55]	CHEN W K, LI B G, GUPTA R. Code compaction of matching single-entry multiple-exit regions[C]// LNCS 2694: Proceedings of the 10th International Symposium Static Analysis. Berlin, Heidelberg: Springer, 2003: 401-417.
[56]	KIM M, SAZAWAL V, NOTKIN D, et al. An empirical study of code clone genealogies[C]// Proceedings of the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Lisbon, Sep 5-9, 2005. New York: ACM, 2005: 187-196.
[57]	PATENAUDE J, MERLO E, DAGENAIS M, et al. Exte-nding software quality assessment techniques to Java syst-ems[C]// Proceedings of the 7th International Workshop on Program Comprehension, Pittsburgh, May 5-7, 1999. Wash-ington: IEEE Computer Society, 1999: 49-56.
[58]	SHENEAMER A, KALITA J. A survey of software clone detection techniques[J]. International Journal of Computer Applications, 2016, 137(10): 1-21.
[59]	BAKER B. On finding duplication and near-duplication in large software systems[C]// Proceedings of the 2nd Work-ing Conference on Reverse Engineering, Toronto, Jul 14-16, 1995. Piscataway: IEEE, 1995: 86-95.
[60]	ROY C K, CORDY J R. NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization[C]// Proceedings of the 16th IEEE Inter-national Conference on Program Comprehension, Ams-terdam, Jun 10-13, 2008. Washington: IEEE Computer Soci-ety, 2008: 172-181.
[61]	MONDAL M, RAHMAN M S, ROY C K, et al. Is cloned code really stable[J]. Empirical Software Engineering, 2018, 23(2): 693-770. DOI URL
[62]	JÜRGENS E, DEISSENBOECK F, HUMMEL B, et al. Do code clones matter[C]// Proceedings of the 31st Interna-tional Conference on Software Engineering, Vancouver, May 16-24, 2009. Piscataway: IEEE, 2019: 485-495.
[63]	MONDAL M, ROY C, SCHNEIDER K. Dispersion of changes in cloned and non-cloned code[C]// Proceeding of the 6th International Workshop on Software Clones, Zurich, Jun 4, 2012. Washington: IEEE Computer Society, 2012: 29-35.
[64]	LOZANO A, WERMELINGER M. Tracking clones' imprint[C]// Proceeding of the 4th ICSE International Workshop on Software Clones, Cape Town. New York: ACM, 2010: 65-72.
[65]	陈秋远, 李善平, 鄢萌, 等. 代码克隆检测研究进展[J]. 软件学报, 2019, 30(4): 962-980.
	CHEN Q Y, LI S P, YAN M, et al. Code clone detection: a literature review[J]. Journal of Software, 2019, 30(4): 962-980.
[66]	BELLON S, KOSCHKE R, ANTONIOL G, et al. Comp-arison and evaluation of clone detection tools[J]. IEEE Transactions on Software Engineering, 2007, 33(9): 577-591. DOI URL
[67]	KAMIYA T, KUSUMOTO S, INOUE K. CCFinder: a multilinguistic token-based code clone detection system for large scale source code[J]. IEEE Transactions on Software Engineering, 2002, 28(7): 654-670. DOI URL
[68]	DUCASSE S, RIEGER M, DEMEYER S. A language independent approach for detecting duplicated code[C]// Proceedings of the 1999 International Conference on Soft-ware Maintenance, Oxford, Aug 30-Sep 3, 1999. Washington: IEEE Computer Society, 1999: 109-118.
[69]	LEE S, JEONG I. SDD: high performance code clone detection system for large scale source code[C]// Procee-dings of the Companion to the 20th Annual ACM SIG-PLAN Conference on Object-Oriented Programming, Syst-ems, Languages, and Applications, San Diego, Oct 16-20, 2005. New York: ACM, 2005: 140-141.
[70]	MURAKAMI H, HOTTA K, HIGO Y, et al. Gapped code clone detection with lightweight source code analysis[C]// Proceedings of the IEEE 21st International Conference on Program Comprehension, San Francisco, May 20-21, 2013. Washington: IEEE Computer Society, 2013: 93-102.
[71]	DANG Y N, ZHANG D M, GE S, et al. XIAO: tuning code clones at hands of engineers in practice[C]// Proceedings of the 28th Annual Computer Security Applications, Orlando, Dec 3-7, 2012. New York: ACM, 2012: 369-378.
[72]	ALOMARI H, MATTHEW S. Clone detection through srcClone: a program slicing based approach[J]. Journal of Systems and Software, 2022, 184: 111115. DOI URL
[73]	LI L, FENG H, ZHUANG W. CCLearner: a deep learning-based clone detection approach[C]// Proceedings of the 2017 IEEE International Conference on Software Maint-enance and Evolution, Shanghai, Sep 17-22, 2017. Washi-ngton: IEEE Computer Society, 2017: 249-260.
[74]	ZHAO G, HUANG J. DeepSim: deep learning code func-tional similarity[C]// Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engin-eering, Lake Buena Vista, Nov 4-9, 2018. New York: ACM, 2018: 141-151.
[75]	GAO Y, WANG Z, LIU S. TECCD: a tree embedding approach for code clone detection[C]// Proceedings of the 2019 IEEE International Conference on Software Mainte-nance and Evolution, Cleveland, Sep 29-Oct 4, 2019. Pisc-ataway: IEEE, 2019: 145-156.
[76]	HUA W, SUI Y, WAN Y. FCCA: hybrid code representation for functional clone detection using attention networks[J]. IEEE Transactions on Reliability, 2020, 70(1): 304-318. DOI URL
[77]	FANG C, LIU Z, SHI Y, et al. Functional code clone detection with syntax and semantics fusion learning[C]// Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. New York: ACM, 2020: 516-527.
[78]	MENG Y, LIU L. A deep learning approach for a source code detection model using self-attention[J]. Complexity, 2020, 9: 1-15.
[79]	GUO C, YANG H, HUANG D. Review sharing via deep semi-supervised code clone detection[J]. IEEE Access, 2020, 8: 24948-24965. DOI URL
[80]	YE F, ZHOU S, VENKAT A. MISIM: an end-to-end neural code similarity system[J]. arXiv:2006.05265, 2020.
[81]	ZHANG A, LIU K, FANG L, et al. Learn to align: a code alignment network for code clone detection[C]// Proceed-ings of the 28th Asia-Pacific Software Engineering Confere-nce, Taipei, China, Dec 6-9, 2021. Piscataway: IEEE, 2021: 1-11.
[82]	LIANG H, AI L. AST-path based compare-aggregate network for code clone detection[C]// Proceedings of the 2021 Int-ernational Joint Conference on Neural Networks, Shen-zhen, Jul 18-22, 2021. Piscataway: IEEE, 2021: 1-8.
[83]	SINGER J, LETHB T C, VINSON N G, et al. An exami-nation of software engineering work practices[C]// Procee-dings of the 1997 Conference of the Centre for Advanced Studies on Collaborative Research, Toronto, Nov 10- 13: 21.
[84]	ZHONG H, XIE T, ZHANG L, et al. mAPO: mining and recommending API usage patterns[C]// LNCS 5653: Procee-dings of the 23rd European Conference on Object-Oriented Programming, Genoa, Jul 6-10, 2009. Berlin, Heidelberg: Springer, 2009: 318-343.
[85]	张峰逸, 彭鑫, 陈驰. 基于深度学习的代码分析研究综述[J]. 计算机应用与软件, 2018, 35(6): 9-17.
	ZHANG F Y, PENG X, CHEN C. Research on code analy-sis based on deep learning[J]. Computer Applications and Software, 2018, 35(6): 9-17.
[86]	SUBRAMANIAN S, INOZEMTSEVA L, HOLMES R. Live API documentation[C]// Proceedings of the 36th Intern-ational Conference on Software Engineering, Hyderabad, May 31-Jun 7, 2014. New York: ACM, 2014: 643-652.
[87]	KIM K, KIM D, BISSYANDÉ T F, et al. FaCoY: a code-to-code search engine[C]// Proceedings of the 40th Internati-onal Conference on Software Engineering, Gothenburg, May 27-Jun 3, 2018. New York: ACM, 2018: 946-957.
[88]	DEERWESTER S C, DUMAIS S T, LANDAUER T K, et al. Indexing by latent semantic analysis[J]. Journal of the Ame-rican Society for Information Science, 1990, 41(6): 391-407.
[89]	BENGIO Y, DUCHARME R, VINCENT P, et al. A neural probabilistic language model[J]. Journal of Machine Learn-ing Research, 2003, 3(2): 1137-1155.
[90]	EGOZI O, MARKOVITCH S, GABRILOVICH E. Concept-based information retrieval using explicit semantic analysis[J]. ACM Transactions on Information Systems, 2011, 29(2): 1-34.
[91]	SCHUHMACHER M, PONZETTO S P. Knowledge-based graph document modeling[C]// Proceedings of the 7th ACM International Conference on Web Search and Data Mining, New York, Feb 24-28, 2014. New York: ACM, 2014: 543-552.
[92]	LIU X T, FANG H. Latent entity space: a novel retrieval approach for entity-bearing queries[J]. Information Retrie-val Journal, 2015, 18(6): 473-503.
[93]	XIONG C Y, CALLAN J. EsdRank: connecting query and documents through external semi-structured data[C]// Proc-eedings of the 24th ACM International Conference on Information and Knowledge Management, Melbourne, Oct 19-23, 2015. New York: ACM, 2015: 951-960.
[94]	RAVIV H, KURLAND O, CARMEL D. Document retrie-val using entity-based language models[C]// Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Jul 17-21, 2016. New York: ACM, 2016: 65-74.
[95]	NI Y, XU Q K, CAO F, et al. Semantic documents related-ness using concept graph representation[C]// Proceedings of the 9th ACM International Conference on Web Search and Data Mining, San Francisco, Feb 22-25, 2016. New York: ACM, 2016: 635-644.
[96]	GABRILOVICH E, MARKOVITCH S. Computing seman-tic relatedness using Wikipedia-based explicit semantic analysis[C]// Proceedings of the 2007 International Joint Conference on Artificial Intelligence, Hyderabad, Jan 6-12, 2007. San Mateo: Morgan Kaufmann, 2007: 1606-1611.
[97]	SACHDEV S, LI H Y, LUAN S F, et al. Retrieval on source code: a neural code search[C]// Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Lear-ning and Programming Languages, Philadelphia, Jun 18-22, 2018. New York: ACM, 2018: 31-41.
[98]	LV F, ZHANG H Y, LOU J G, et al. CodeHow: effective code search based on API understanding and extended Boolean model[C]// Proceedings of the 30th IEEE/ACM International Conference on Automated Software Enginee-ring, Lincoln, Nov 9-13, 2015. Washington: IEEE Computer Society, 2015: 260-270.
[99]	FANG S, TAN Y, ZHANG T, et al. Self-attention networks for code search[J]. Information and Software Technology, 2021, 134: 106542-106553. DOI URL
[100]	GU J, CHEN Z, MONPERRUS M. Multimodal represen-tation for neural code search[C]// Proceedings of the 2021 International Conference on Software Maintenance and Evolution, Luxembourg, Sep 27-Oct 1, 2021. Piscataway: IEEE, 2021: 483-494.
[101]	MENG Y. An intelligent code search approach using hybrid encoders[J]. Wireless Communications and Mobile Computing, 2021: 9990988.
[102]	XU L, YANG H, LIU C, et al. Two-stage attention-based model for code search with textual and structural features[C]// Proceedings of the 28th IEEE International Confer-ence on Software Analysis, Evolution and Reengineering, Honolulu, Mar 9-12, 2021. Piscataway: IEEE, 2021: 342-353.
[103]	ZOU Y Z, LING C Y, LIN Z Q, et al. Graph embedding based code search in software project[C]// Proceedings of the 10th Asia-Pacific Symposium on Internetware, Beijing, Sep 16, 2018. New York: ACM, 2018: 1-10.
[104]	GORIN R E. SPELL: a spelling checking and correction program[J]. Online Documentation for the DEC-10 Com-puter, 1971: 147-160.
[105]	BRUCH M, MONPERRUS M, MEZINI M. Learning from examples to improve code completion systems[C]// Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT International Symposium on Foundations of Software Engineering, Amsterdam, Aug 24-28, 2009. New York: ACM, 2009: 213-222.
[106]	HOU D Q, PLETCHER D M. An evaluation of the strategies of sorting, filtering, and grouping API methods for code completion[C]// Proceedings of the IEEE 27th International Conference on Software Maintenance, Wil-liamsburg, Sep 25-30, 2011. Washington: IEEE Computer Society, 2011: 233-242.
[107]	LEE Y Y, HARWELL S, Khurshid S, et al. Temporal code completion and navigation[C]// Proceedings of the 35th International Conference on Software Engineering, San Francisco, May 18-26, 2013. Washington: IEEE Computer Society, 2013: 1181-1184.
[108]	NGUYEN A T, NGUYEN H A, NGUYEN T T, et al. GraPacc: a graph-based pattern-oriented, context-sensitive code completion tool[C]// Proceedings of the 34th Interna-tional Conference on Software Engineering, Zurich, Jun 2-9, 2012. Washington: IEEE Computer Society, 2012: 1407-1410.
[109]	JIN X H, SERVANT F. The hidden cost of code comple-tion: understanding the impact of the recommendation-list length on its efficiency[C]// Proceedings of the 15th International Conference on Mining Software Repositories, Gothenburg, May 28-29, 2018. New York: ACM, 2018: 70-73.
[110]	ZHONG H, WANG X Y. Boosting complete-code tool for partial program[C]// Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engine-ering, Urbana, Oct 30-Nov 3, 2017. Washington: IEEE Computer Society, 2017: 671-681.
[111]	NGUYEN T T, NGUYEN A T, NGUYEN H A, et al. A statistical semantic language model for source code[C]// Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, Saint Petersbury, Aug 18-26, 2013. New York: ACM, 2013: 532-542.
[112]	DE SOUZA AMORIM L E, ERDWEG S, WACHSMUTH G, et al. Principled syntactic code completion using place-holders[C]// Proceedings of the 2016 ACM SIGPLAN Inter-national Conference on Software Language Engineering, Amsterdam, Oct 31-Nov 1, 2016. New York: ACM, 2016: 163-175.
[113]	HOU D Q, PLETCHER D M. Towards a better code com-pletion system by API grouping, filtering, and popularity-based ranking[C]// Proceedings of the 2nd International Workshop on Recommendation Systems for Software Engineering, Cape Town, May 4, 2010. New York: ACM, 2010: 26-30.
[114]	JACOBELLIS J, MENG N, KIM M. Cookbook: in Situ code completion using edit recipes learned from examples[C]// Companion Proceedings of the 36th International Conference on Software Engineering, Hyderabad, May 31-Jun 7, 2014. New York: ACM, 2014: 584-587.
[115]	NGUYEN T T, PHAM H V, VU P M, et al. Recommend-ing API usages for mobile Apps with hidden Markov model[C]// Proceedings of the 30th IEEE/ACM International Con-ference on Automated Software Engineering, Lincoln, Nov 9-13, 2015. Washington: IEEE Computer Society, 2015: 795-800.
[116]	GVERO T, KUNCAK V, KURAJ I, et al. Complete completion using types and weights[J]. ACM SIGPLAN Notices, 2013, 48(6): 27-38.
[117]	FERNANDES P, ALLAMANIS M, BROCKSCHMIDT M. Structured neural summarization[J]. arXiv:1811.01824, 2018.
[118]	KARAMPATSIS R, BABII H, ROBBES R, et al. Big code !=big vocabulary: open-vocabulary models for source code[C]// Proceedings of the 42nd International Confere-nce on Software Engineering, Seoul, Jun 27-Jul 19, 2020. New York: ACM, 2020: 1073-1085.
[119]	杨博, 张能, 李善平, 等. 智能代码补全研究综述[J]. 软件学报, 2020, 31(5): 1435-1453.
	YANG B, ZHANG N, LI S P, et al. Survey of intelligent code completion[J]. Journal of Software, 2020, 31(5): 1435-1453.
[120]	HAN S, WALLACE D R, MILLER R C. Code completion of multiple keywords from abbreviated input[J]. Autom-ated Software Engineering, 2011, 18(3/4): 363-398.
[121]	HAN S, WALLACE D R, MILLER R C. Code completion from abbreviated input[C]// Proceedings of the 24th IEEE/ACM International Conference on Automated Software Engineering, Auckland, Nov 16-20, 2009. Washington: IEEE Computer Society, 2009: 332-343.
[122]	RAYCHEV V, BIELIK P, VECHEV M. Probabilistic model for code with decision trees[J]. ACM SIGPLAN Notices, 2016, 51(10): 731-747. DOI URL
[123]	HELLENDOORN V J, DEVANBU P. Are deep neural networks the best choice for modeling source code?[C]// Proceedings of the 2017 11th Joint Meeting on Foun-dations of Software Engineering, Paderborn, Sep 4-8, 2017. New York: ACM, 2017: 763-773.
[124]	BAZZI I. Modelling out-of-vocabulary words for robust speech recognition[D]. Massachusetts Institute of Technol-ogy, 2002.
[125]	LUONG M T, SOCHER R, MANNING C D. Better word representations with recursive neural networks for morph-ology[C]// Proceedings of the 17th Conference on Comput-ational Natural Language Learning, Sofia, Aug 8-9, 2013. Stroudsburg: ACL, 2013: 104-113.
[126]	HARRIS Z. Distributional structure[J]. Word, 1981, 10(2/3): 146-162. DOI URL
[127]	BABII H, JANES A, ROBBES R. Modeling vocabulary for big code machine learning[J]. arXiv:1904.01873, 2019.
[128]	DEVLIN J, CHANG M W, LEE K. BERT: PRE-training of deep bidirectional transformers for language underst-anding[J]. arXiv:1810.04805, 04805.
[129]	RADFORD A, NARASIMHAN K, SALIMANS T. Impr-oving language understanding by generative pre-training[EB/OL]. [2021-07-06]. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
[130]	PETERS M, NEUMANN M, IYYER M. Deep context-ualized word representations[J]. arXiv:1802.05365, 2018.
[131]	KANG H J, BISSYANDÉ T F, LO D. Assessing the gener-alizability of code2vec token embeddings[C]// Proce-edings of the 34th IEEE/ACM International Conference on Automated Software Engineering, San Diego, Nov 11-15, 2019. Piscataway: IEEE, 2019: 1-12.

编辑推荐 0

Metrics

阅读次数

全文

361

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	39	30	0	292

来源	本网站	其他网站

次数	331	30
比例	92%	8%

摘要

628

最新录用	在线预览	正式出版

50	0	578

	来源	本网站

	次数	628
	比例	100%

编程语言	工具名称	工具地址
Java	Javaparser	http://Javaparser.org/
Python	astor	https://github.com/berkerpeksag/astor
TypeScript	TypeScript AST Viewer	https://ts-ast-viewer.com/
JavaScript	Javascript- astar	https://github.com/bgrins/Javascript-astar
C	pycparser	https://github.com/eliben/pycparser
C++	cppast	https://github.com/foonathan/cppast

编程语言	工具名称	工具地址
Java	Javaparser	http://Javaparser.org/
Python	astor	https://github.com/berkerpeksag/astor
TypeScript	TypeScript AST Viewer	https://ts-ast-viewer.com/
JavaScript	Javascript- astar	https://github.com/bgrins/Javascript-astar
C	pycparser	https://github.com/eliben/pycparser
C++	cppast	https://github.com/foonathan/cppast

模型简称	检测类型	神经网络模型	检测语言	时间
CCLearner^[73]	Type-1,2,3(ST)	深度神经网络	Java	2017
CDLH^[44]	Type-1,2,3,4	长短期记忆网络	Java、C	2017
DeepSim^[74]	Type-1,2,3,4	前馈神经网络	Java	2018
ASTNN^[10]	Type-1,2,3,4	门控循环单元	Java、C	2019
TECCD^[75]	Type-1,2,3	图神经网络	Java	2019
FCCA^[76]	Type-1,2,3,4	长短期记忆网络、图神经网络	Java	2020
FCDetector^[77]	Type-4	深度神经网络	C	2020
At-biLSTM^[78]	Type-1,2,3,4	双向长短期记忆网络	Java、C	2020
Rsharer+^[79]	Type-1,2,3,4	卷积神经网络	Java	2020
MISIM^[80]	Type-1,2,3,4	图神经网络	C、C++	2020
CodeAli^[81]	Type-1,2,3,4	卷积神经网络	Java、C	2021
CACCD^[82]	Type-1,2,3,4	双向长短期记忆网络	Java	2021

模型简称	检测类型	神经网络模型	检测语言	时间
CCLearner^[73]	Type-1,2,3(ST)	深度神经网络	Java	2017
CDLH^[44]	Type-1,2,3,4	长短期记忆网络	Java、C	2017
DeepSim^[74]	Type-1,2,3,4	前馈神经网络	Java	2018
ASTNN^[10]	Type-1,2,3,4	门控循环单元	Java、C	2019
TECCD^[75]	Type-1,2,3	图神经网络	Java	2019
FCCA^[76]	Type-1,2,3,4	长短期记忆网络、图神经网络	Java	2020
FCDetector^[77]	Type-4	深度神经网络	C	2020
At-biLSTM^[78]	Type-1,2,3,4	双向长短期记忆网络	Java、C	2020
Rsharer+^[79]	Type-1,2,3,4	卷积神经网络	Java	2020
MISIM^[80]	Type-1,2,3,4	图神经网络	C、C++	2020
CodeAli^[81]	Type-1,2,3,4	卷积神经网络	Java、C	2021
CACCD^[82]	Type-1,2,3,4	双向长短期记忆网络	Java	2021

基于深度学习的代码表征及其应用综述

Overview of Deep Learning-Based Code Representation and Its Applications

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献 131

相关文章 15

编辑推荐 0

Metrics

[1]	吕晓琦, 纪科, 陈贞翔, 孙润元, 马坤, 邬俊, 李浥东. 结合注意力与循环神经网络的专家推荐算法[J]. 计算机科学与探索, 2022, 16(9): 2068-2077.
[2]	李冬梅, 罗斯斯, 张小平, 许福. 命名实体识别方法研究综述[J]. 计算机科学与探索, 2022, 16(9): 1954-1968.
[3]	任宁, 付岩, 吴艳霞, 梁鹏举, 韩希. 深度学习应用于目标检测中失衡问题研究综述[J]. 计算机科学与探索, 2022, 16(9): 1933-1953.
[4]	杨才东, 李承阳, 李忠博, 谢永强, 孙方伟, 齐锦. 深度学习的图像超分辨率重建技术综述[J]. 计算机科学与探索, 2022, 16(9): 1990-2010.
[5]	曾凡智, 许露倩, 周燕, 周月霞, 廖俊玮. 面向智慧教育的知识追踪模型研究综述[J]. 计算机科学与探索, 2022, 16(8): 1742-1763.
[6]	安凤平, 李晓薇, 曹翔. 权重初始化-滑动窗口CNN的医学图像分类[J]. 计算机科学与探索, 2022, 16(8): 1885-1897.
[7]	夏鸿斌, 肖奕飞, 刘渊. 融合自注意力机制的长文本生成对抗网络模型[J]. 计算机科学与探索, 2022, 16(7): 1603-1610.
[8]	刘艺, 李蒙蒙, 郑奇斌, 秦伟, 任小广. 视频目标跟踪算法综述[J]. 计算机科学与探索, 2022, 16(7): 1504-1515.
[9]	赵小明, 杨轶娇, 张石清. 面向深度学习的多模态情感识别研究进展[J]. 计算机科学与探索, 2022, 16(7): 1479-1503.
[10]	杨政, 邓赵红, 罗晓清, 顾鑫, 王士同. 利用ELM-AE和迁移表征学习构建的目标跟踪系统[J]. 计算机科学与探索, 2022, 16(7): 1633-1648.
[11]	孙方伟, 李承阳, 谢永强, 李忠博, 杨才东, 齐锦. 深度学习应用于遮挡目标检测算法综述[J]. 计算机科学与探索, 2022, 16(6): 1243-1259.
[12]	刘雅芬, 郑艺峰, 江铃燚, 李国和, 张文杰. 深度半监督学习中伪标签方法综述[J]. 计算机科学与探索, 2022, 16(6): 1279-1290.
[13]	程卫月, 张雪琴, 林克正, 李骜. 融合全局与局部特征的深度卷积神经网络算法[J]. 计算机科学与探索, 2022, 16(5): 1146-1154.
[14]	钟梦圆, 姜麟. 超分辨率图像重建算法综述[J]. 计算机科学与探索, 2022, 16(5): 972-990.
[15]	裴利沈, 赵雪专. 群体行为识别深度学习方法研究综述[J]. 计算机科学与探索, 2022, 16(4): 775-790.