计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (2): 215-235.DOI: 10.3778/j.issn.1673-9418.1810047

• 学术研究 • 上一篇    下一篇

使用EBIC的软件故障特征选择方法

涂吉屏,钱晔,王炜,范道远,张涵宇   

  1. 1.云南大学 软件学院,昆明 650500
    2.云南省软件工程重点实验室,昆明 650500
    3.云南农业大学 大数据学院(信息工程学院),昆明 650201
  • 出版日期:2020-02-01 发布日期:2020-02-16

Approach to Software Defect Features Selection Using Extended Bayesian Information Criterion

TU Jiping, QIAN Ye, WANG Wei, FAN Daoyuan, ZHANG Hanyu   

  1. 1. School of Software, Yunnan University, Kunming 650500, China
    2. Key Laboratory for Software Engineering of Yunnan Province, Kunming 650500, China
    3. School of Big Data (Information Engineering), Yunnan Agricultural University, Kunming 650201, China
  • Online:2020-02-01 Published:2020-02-16

摘要:

软件故障预测中若采用大量度量指标建立预测模型,可能因其中含有无关特征使预测模型性能受到不良影响,故障预测中的特征选择步骤选取一定维度的部分故障数据建立预测模型来提高模型性能,以达到压缩特征维度,提高模型预测精度,降低预测模型复杂度,节约计算资源的目的。传统特征排序方法仅评估单个特征对类标的影响,建立的预测模型有效性较低;特征子集选择方法需搜索所有特征子集,耗费计算资源且所选特征维数较高。针对以上问题,提出一种基于拓展贝叶斯信息准则的特征选择方法(EBIC-FS),该方法对数据进行线性回归,并计算出残差平方和较小且数据维数较少的特征模型。在公开数据集M&R及Promise上进行实验,结果表明该方法能有效压缩特征维度,且预测模型性能与5种基线方法相比有较大提升。

关键词: 软件故障预测, 特征选择, 拓展贝叶斯信息准则, 最佳特征子集

Abstract:

Using a large number of metrics to establish a software defect prediction model may affect the performance of the prediction model because of unrelated metrics. Feature selection in defect prediction selects a certain dimension of partial defect data to build prediction model, which can achieve the aim of improving the performance of the model, compressing feature dimensions, improving the accuracy of the prediction model, reducing the complexity of the prediction model, and saving computing resources. The traditional feature ranking methods only evaluate the influence of a single feature on the class label, which has low effectiveness; feature subset selection methods need to evaluate all feature subsets, which consumes computing resources, meanwhile, feature subset selection methods tend to select many features. Therefore, this paper proposes a feature selection method based on extended Bayesian information criterion (EBIC-FS), which can make linear regression of the data and select the feature subset with the lowest sum of residuals and less feature dimensions. Experiments are conducted on benchmark datasets M&R and Promise. The results show that the method can compress the dimension of features effectively, and the performance of the prediction model is greatly improved compared with 5 baseline methods.

Key words: software defect prediction, feature selection, extended Bayesian information criterion, best feature subset