计算机科学与探索 ›› 2017, Vol. 11 ›› Issue (2): 212-220.DOI: 10.3778/j.issn.1673-9418.1512105

• 学术研究 • 上一篇    下一篇

面向Java程序包的代码概要自动生成技术研究

柳  郁1,孙小兵1,2+,李  斌1,2   

  1. 1. 扬州大学 信息工程学院,江苏 扬州 225127
    2. 南京大学 计算机软件新技术国家重点实验室,南京 210023
  • 出版日期:2017-02-01 发布日期:2017-02-10

Research on Automatic Summarization for Java Packages

LIU Yu1, SUN Xiaobing1,2+, LI Bin1,2   

  1. 1. School of Information Engineering, Yangzhou University, Yangzhou, Jiangsu 225127, China
    2. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
  • Online:2017-02-01 Published:2017-02-10

摘要: 程序理解是从软件程序中获得抽象在程序中的功能和知识的过程,对软件维护有着重要的意义。研究表明,软件维护消耗了软件预算的50%到80%,而其中大概47%到62%的维护时间用于对软件系统的理解上。提出了一种面向Java程序的包概要方法,尝试从软件的语义层次出发,利用信息索引领域的潜在语义分析和数据挖掘领域的聚类算法对软件程序中的语义信息进行提取分析。对相似词汇的代码文件进行聚类,并从中提取话题对Java程序中的包进行刻画;对这些话题进行语义恢复,并利用MiniPar,一个英文词法分析器,来辅助生成程序中包的概要信息。实验结果表明该方法能够改进程序理解的效率。

关键词: 程序理解, 潜在语义分析, 聚类, 话题, 概要化

Abstract: Program comprehension is a process of acquiring knowledge from software systems and is important to software maintenance. It is estimated that about 50% to 80% software budget is spent on software maintenance, and about 47% to 62% software maintenance is spent on program comprehension. This paper proposes a novel approach to summarize the packages in a software system based on Java, which employs latent semantic indexing, a typical   information retrieve technique, and hierarchical clustering to derive artifacts from source code and group source files sharing similar vocabulary. Then, topics are retrieved from these clusters and linguistic information is recovered from the generated vocabulary. Finally, this paper employs MiniPar, a parser for English language, to generate the package summarization. The experimental results show that the proposed approach can improve the efficiency of program comprehension process.

Key words: program comprehension, latent semantic indexing, clustering, topic, summarization