计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (4): 844-854.DOI: 10.3778/j.issn.1673-9418.2010087

• 系统软件与软件工程 • 上一篇    下一篇

融合图嵌入和注意力机制的代码搜索

黄思远, 赵宇海+(), 梁燚铭   

  1. 东北大学 计算机科学与工程学院,沈阳 110169
  • 收稿日期:2020-10-28 修回日期:2021-01-18 出版日期:2022-04-01 发布日期:2021-02-04
  • 通讯作者: + E-mail: zhaoyuhai@mail.neu.edu.cn
  • 作者简介:黄思远(1995—),男,吉林四平人,硕士研究生,主要研究方向为深度学习、机器学习等。
    赵宇海(1975—),男,辽宁沈阳人,博士,教授,博士生导师,CCF高级会员,主要研究方向为大数据挖掘、机器学习、社交网络数据分析。
    梁燚铭(1996—),男,河南商丘人,硕士研究生,主要研究方向为深度学习、机器学习等。
  • 基金资助:
    国家自然科学基金(61772124);国家重点研发计划(2018YFB1004402)

Code Search Combining Graph Embedding and Attention Mechanism

HUANG Siyuan, ZHAO Yuhai+(), LIANG Yiming   

  1. School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China
  • Received:2020-10-28 Revised:2021-01-18 Online:2022-04-01 Published:2021-02-04
  • About author:HUANG Siyuan, born in 1995, M.S. candidate. His research interests include deep learning, machine learning, etc.
    ZHAO Yuhai, born in 1975, Ph.D., professor, Ph.D. supervisor, senior member of CCF. His research interests include big data mining, machine learning and social network data analysis.
    LIANG Yiming, born in 1996, M.S. candidate. His research interests include deep learning, machine learning, etc.
  • Supported by:
    National Natural Science Foundation of China(61772124);National Key Research and Development Program of China(2018YFB1004402)

摘要:

源代码检索任务是指将自然语言作为查询语句,从代码库中搜索相关代码片段。在代码检索任务中,大多数代码检索算法只考虑代码片段的文本序列信息而未考虑代码的结构信息,导致不能充分捕获代码片段包含的语义和语法信息。为了提高对程序语言的理解,提出了注意力机制和图嵌入相结合的代码检索算法(GraphCS)。在特征提取部分,以LSTM提取文本特征向量表示,以Graph2Vec提取图的向量特征表示。在特征融合部分中引入注意力机制,更好地为每一个特征分配相应的权重,从而提升程序的理解。考虑源代码和自然语言为异构数据,将代码片段特征和自然语言特征映射到同一个向量空间,以排名损失来保证语义相似的点在特征空间拥有较近的距离。为了验证算法的高效性,与目前最好的算法CODEnn进行对比。实验结果表明,在Precision@1/5/10、SuccessRate@1/5/10以及MRR上均有一定的提升。

关键词: 源代码检索, 注意力机制, 图嵌入, 自然语言, 语义相似, 向量空间

Abstract:

The source code retrieval task refers to using natural language as a query statement to search for relevant code fragments in the code base. In code retrieval task, most code retrieval algorithms only consider the text sequence information of the code snippets without considering the structural information of the code, resulting in the inability to fully capture the semantic and grammatical information contained in the code snippets. In order to improve the understanding of programming languages, a code retrieval algorithm (GraphCS) based on the combination of attention mechanism and graph embedding is proposed. In the feature extraction part, LSTM is used to extract the text feature vector representation, and Graph2Vec is used to extract the graph vector feature representation. The attention mechanism is introduced in the feature fusion part to better assign corresponding weights to each feature, thereby improving the understanding of the program. Considering heterogeneous data in source code and natural language, the code fragment features and natural language features are mapped to the same vector space, and ranking loss is used to ensure that the semantically similar points have a closer distance in the feature space. In order to verify the efficiency of the algorithm, it is compared with the best algorithm CODEnn. Experimental results show that there is a certain improvement in Precision@1/5/10, SuccessRate@1/5/10 and MRR.

Key words: source code retrieval, attention mechanism, graph embedding, natural language, semantic similarity, vector space

中图分类号: