计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (3): 389-400.DOI: 10.3778/j.issn.1673-9418.1901059

• 学术研究 • 上一篇    下一篇

面向开源源码大数据的数据质量研究

包盼盼,陶传奇,黄志球   

  1. 1.南京航空航天大学 计算机科学与技术学院,南京 210016
    2.南京航空航天大学 高安全系统的软件开发与验证技术工信部重点实验室,南京 210016
    3.南京大学 计算机软件新技术国家重点实验室,南京 210023
    4.软件新技术与产业化协同创新中心,南京 210016
  • 出版日期:2020-03-01 发布日期:2020-03-13

Research on Data Quality of Open Source Code Data

BAO Panpan, TAO Chuanqi, HUANG Zhiqiu   

  1. 1.College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
    2.Ministry Key Laboratory for Safety-Critical Software Development and Verification, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
    3.State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
    4.Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210016, China
  • Online:2020-03-01 Published:2020-03-13

摘要:

基于开源源码大数据进行代码生成、缺陷预测等是当前智能化软件开发方法与技术的重要研究内容。然而现有的关注点主要聚焦于各种推荐、预测等智能算法的研究,较少对研究所使用数据的质量进行评估与分析。大部分智能化软件开发研究的数据来源于开源数据托管平台,受限于开发者自身水平,它们并不能保证都具有较高质量。根据“garbage in,garbage out”,这会影响最终结果质量。源码数据的质量对相关的研究有重要影响,却没有得到足够的重视。针对上述问题,提出了一种面向开源源码大数据的方法块数据质量评估方法。首先研究如何定义和评估GitHub上抽取的源码的数据质量问题,然后对开源源码从不同维度进行质量评估。通过该源码数据质量评估方法可以帮助相关研究人员构建具有更高质量的数据集,进而提高智能化相关研究,比如代码生成、缺陷预测等的结果质量。

关键词: 编程智能化, 开源大数据, 源码数据, 数据质量

Abstract:

Code generation and bug prediction based on open source code data are the typical application fields in current intelligent software development. However, the existing researches mainly focus on diverse intelligent algorithms applied in different applications, such as recommendation and prediction. The quality of the data used in the research is seldom evaluated and analyzed. Most of the data used in intelligent technologies come from open source code. Due to the variety of software developers and programmers, there exists a clear quality issue frequently. According to garbage in and garbage out, this affects the final results quality. The quality of source data has an important impact on relevant research, but has not received sufficient attention. Aiming to address the quality problem, this paper proposes an approach to data quality evaluation and analysis for open source code. First, this paper studies how to define and evaluate the quality of the source code extracted from GitHub, and then evaluates the quality from different dimensions. The benefits of the approach can support related researchers to construct data sets with higher quality and make further improvement in intelligent application effects, such as code generation and bug prediction.

Key words: intelligent programming, open source big data, source data, data quality