计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (9): 2241-2251.DOI: 10.3778/j.issn.1673-9418.2207105

• 大数据技术 • 上一篇    下一篇

面向K-近邻学习模型的高效数据清洗框架

王婧怡,陈胤佳,袁野,陈辰,王国仁   

  1. 1. 北京理工大学 计算机学院,北京 100081
    2. 北京航空航天大学 计算机学院,北京 100191
  • 出版日期:2023-09-01 发布日期:2023-09-01

Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models

WANG Jingyi, CHEN Yinjia, YUAN Ye, CHEN Chen, WANG Guoren   

  1. 1. School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081, China
    2. School of Computer Science & Engineering, Beihang University, Beijing 100191, China
  • Online:2023-09-01 Published:2023-09-01

摘要: 现实世界中收集的数据集通常是含有缺失的,为了在不完备数据集上构建有效的机器学习模型,需要对数据集进行清洗。为了确保较好的清洗效果,通常需要人工参与,从而导致大量成本。确定不完备数据的清洗优先级将有助于减小清洗规模,节约人工成本。而计算不完备数据的清洗优先级应确定其对模型性能的贡献。夏普利值是目前流行的用来评估数据在机器学习模型中贡献的方法,因此可以借助夏普利值的概念计算不完备数据的清洗优先级。由于现有工作缺少对不完备数据夏普利值的研究,首先基于不完备数据集的指数级的所有可能世界定义了一种不完备数据夏普利值的表示方法;然后基于K-近邻分类模型的效用函数,提出了一种多项式时间内计算不完备数据在K-近邻分类模型中夏普利值的近似算法;最后提出了一种基于夏普利值的面向K-近邻分类模型的启发式数据清洗算法ShapClean。实验表明,该算法在清洗后模型分类准确率方面往往可以明显超过现有的针对机器学习模型的自动清洗算法,而且相比同样需要人工参与的数据清洗算法,该方法具有更高的清洗效率,可以有效节约人工成本,同时保证理想的模型准确度。

关键词: 不完备数据集, 夏普利值, K-近邻(KNN), 清洗优先级, 数据清洗

Abstract: Real-world datasets are often collected with missing data, and in order to build effective machine learning models on incomplete datasets, the datasets need to be cleaned. To ensure the quality of the cleaned datasets, human involvement is often required, which incurs considerable costs. Prioritizing the cleaning of incomplete data will help minimize cleaning scale and save labor costs. Calculating the priority needs determining the contribution of the incomplete data to the performance of the model. Shapley value is a popular method for evaluating the contribution, so it can be used to calculate the cleaning priority. Due to the lack of existing work on Shapley value of incomplete data, a representation of Shapley value of incomplete data is firstly defined based on the possible worlds of the datasets. And an approximation algorithm for calculating Shapley value of incomplete data in the K-nearest neighbor classification model in polynomial time is proposed based on the K-nearest neighbor utility. Finally, the ShapClean, a heuristic data cleaning algorithm based on Shapley value, is proposed. Experiments show that the algorithm can often significantly exceed the existing automatic cleaning algorithms in terms of the accuracy. And compared with data cleaning algorithms that also require human involvement, the ShapClean can save more labour costs while ensuring the desired model accuracy.

Key words: incomplete dataset, Shapley value, K-nearest neighbor (KNN), cleaning priority, data cleaning