面向开源源码大数据的数据质量研究

doi:10.3778/j.issn.1673-9418.1901059

计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (3): 389-400.DOI: 10.3778/j.issn.1673-9418.1901059

面向开源源码大数据的数据质量研究

包盼盼，陶传奇，黄志球

1.南京航空航天大学计算机科学与技术学院，南京 210016
2.南京航空航天大学高安全系统的软件开发与验证技术工信部重点实验室，南京 210016
3.南京大学计算机软件新技术国家重点实验室，南京 210023
4.软件新技术与产业化协同创新中心，南京 210016

出版日期:2020-03-01 发布日期:2020-03-13

Research on Data Quality of Open Source Code Data

BAO Panpan, TAO Chuanqi, HUANG Zhiqiu

1.College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
2.Ministry Key Laboratory for Safety-Critical Software Development and Verification, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
3.State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
4.Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210016, China

Online:2020-03-01 Published:2020-03-13

摘要/Abstract

摘要：

基于开源源码大数据进行代码生成、缺陷预测等是当前智能化软件开发方法与技术的重要研究内容。然而现有的关注点主要聚焦于各种推荐、预测等智能算法的研究，较少对研究所使用数据的质量进行评估与分析。大部分智能化软件开发研究的数据来源于开源数据托管平台，受限于开发者自身水平，它们并不能保证都具有较高质量。根据“garbage in，garbage out”，这会影响最终结果质量。源码数据的质量对相关的研究有重要影响，却没有得到足够的重视。针对上述问题，提出了一种面向开源源码大数据的方法块数据质量评估方法。首先研究如何定义和评估GitHub上抽取的源码的数据质量问题，然后对开源源码从不同维度进行质量评估。通过该源码数据质量评估方法可以帮助相关研究人员构建具有更高质量的数据集，进而提高智能化相关研究，比如代码生成、缺陷预测等的结果质量。

关键词: 编程智能化, 开源大数据, 源码数据, 数据质量

Abstract:

Code generation and bug prediction based on open source code data are the typical application fields in current intelligent software development. However, the existing researches mainly focus on diverse intelligent algorithms applied in different applications, such as recommendation and prediction. The quality of the data used in the research is seldom evaluated and analyzed. Most of the data used in intelligent technologies come from open source code. Due to the variety of software developers and programmers, there exists a clear quality issue frequently. According to garbage in and garbage out, this affects the final results quality. The quality of source data has an important impact on relevant research, but has not received sufficient attention. Aiming to address the quality problem, this paper proposes an approach to data quality evaluation and analysis for open source code. First, this paper studies how to define and evaluate the quality of the source code extracted from GitHub, and then evaluates the quality from different dimensions. The benefits of the approach can support related researchers to construct data sets with higher quality and make further improvement in intelligent application effects, such as code generation and bug prediction.

Key words: intelligent programming, open source big data, source data, data quality

包盼盼，陶传奇，黄志球. 面向开源源码大数据的数据质量研究[J]. 计算机科学与探索, 2020, 14(3): 389-400.

BAO Panpan, TAO Chuanqi, HUANG Zhiqiu. Research on Data Quality of Open Source Code Data[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(3): 389-400.

[1]	韩超，段磊，邓松，王慧锋，唐常杰. 基于Spark的序列数据质量评价[J]. 计算机科学与探索, 2017, 11(6): 897-907.
[2]	张安珍，门雪莹，王宏志，李建中，高宏. 大数据上基于Hadoop的不一致数据检测与修复算法[J]. 计算机科学与探索, 2015, 9(9): 1044-1055.
[3]	王慧锋，段磊，胡斌，邓松，王文韬，秦攀. 带间隔约束的序列数据质量评价算法设计[J]. 计算机科学与探索, 2015, 9(10): 1180-1194.
[4]	王丹丽1+ , 刘国华1,2,3 , 宋金玲1,4 , 李芳玲5 . k-匿名模型中准标识符最佳值的求解问题*[J]. 计算机科学与探索, 2010, 4(11): 1010-1018.
[5]	孟啸+，王宏志，高宏，李建中. bibEOS：一个高质量的社会化文献检索与管理系统[J]. 计算机科学与探索, 2010, 4(1): 54-63.

面向开源源码大数据的数据质量研究

Research on Data Quality of Open Source Code Data

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 5

编辑推荐

Metrics