面向K-近邻学习模型的高效数据清洗框架

doi:10.3778/j.issn.1673-9418.2207105

摘要/Abstract

摘要： 现实世界中收集的数据集通常是含有缺失的，为了在不完备数据集上构建有效的机器学习模型，需要对数据集进行清洗。为了确保较好的清洗效果，通常需要人工参与，从而导致大量成本。确定不完备数据的清洗优先级将有助于减小清洗规模，节约人工成本。而计算不完备数据的清洗优先级应确定其对模型性能的贡献。夏普利值是目前流行的用来评估数据在机器学习模型中贡献的方法，因此可以借助夏普利值的概念计算不完备数据的清洗优先级。由于现有工作缺少对不完备数据夏普利值的研究，首先基于不完备数据集的指数级的所有可能世界定义了一种不完备数据夏普利值的表示方法；然后基于K-近邻分类模型的效用函数，提出了一种多项式时间内计算不完备数据在K-近邻分类模型中夏普利值的近似算法；最后提出了一种基于夏普利值的面向K-近邻分类模型的启发式数据清洗算法ShapClean。实验表明，该算法在清洗后模型分类准确率方面往往可以明显超过现有的针对机器学习模型的自动清洗算法，而且相比同样需要人工参与的数据清洗算法，该方法具有更高的清洗效率，可以有效节约人工成本，同时保证理想的模型准确度。

关键词: 不完备数据集, 夏普利值, K-近邻（KNN）, 清洗优先级, 数据清洗

Abstract: Real-world datasets are often collected with missing data, and in order to build effective machine learning models on incomplete datasets, the datasets need to be cleaned. To ensure the quality of the cleaned datasets, human involvement is often required, which incurs considerable costs. Prioritizing the cleaning of incomplete data will help minimize cleaning scale and save labor costs. Calculating the priority needs determining the contribution of the incomplete data to the performance of the model. Shapley value is a popular method for evaluating the contribution, so it can be used to calculate the cleaning priority. Due to the lack of existing work on Shapley value of incomplete data, a representation of Shapley value of incomplete data is firstly defined based on the possible worlds of the datasets. And an approximation algorithm for calculating Shapley value of incomplete data in the K-nearest neighbor classification model in polynomial time is proposed based on the K-nearest neighbor utility. Finally, the ShapClean, a heuristic data cleaning algorithm based on Shapley value, is proposed. Experiments show that the algorithm can often significantly exceed the existing automatic cleaning algorithms in terms of the accuracy. And compared with data cleaning algorithms that also require human involvement, the ShapClean can save more labour costs while ensuring the desired model accuracy.

Key words: incomplete dataset, Shapley value, K-nearest neighbor (KNN), cleaning priority, data cleaning

王婧怡, 陈胤佳, 袁野, 陈辰, 王国仁. 面向K-近邻学习模型的高效数据清洗框架[J]. 计算机科学与探索, 2023, 17(9): 2241-2251.

WANG Jingyi, CHEN Yinjia, YUAN Ye, CHEN Chen, WANG Guoren. Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(9): 2241-2251.

参考文献

[1] LI P, RAO X, BLASE J, et al. CleanML: a benchmark for joint data cleaning and machine learning[J]. arXiv:1904. 09483, 2019.
[2] FAN W, GEERTS F. Foundations of data quality management[M]. Morgan & Claypool Publishers, 2012.
[3] HASTIE T, TIBSHIRANI R, FRIEDMAN J H, et al. The elements of statistical learning: data mining, inference, and prediction[M]. Berlin, Heidelberg: Springer, 2009.
[4] FENG H, CHEN G, CHENG Y, et al. A SVM regression based approach to filling in missing values[C]//LNCS 3683: Pro-ceedings of the 2005 International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, Melbourne, Sep 14-16, 2005. Berlin, Heidelberg: Springer, 2005: 581-587.
[5] XIONG H, PANDEY G, STEINBACH M, et al. Enhancing data analysis with noise removal[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18 (3): 304-319.
[6] JIA R, DAO D, WANG B, et al. Towards efficient data valuation based on the Shapley value[C]//Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Naha, Apr 16-18, 2019: 1167-1176.
[7] SHAPLEY L S. A value for n-person games[M]//KUHN H W,?TUCKER?A W. Contributions to the Theory of Games. Princeton University Press, 1953: 307-317.
[8] DUDANI S A. The distance-weighted k-nearest-neighbor rule[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1976, 6(4): 325-327.
[9] EATWELL J, MILGATE M, NEWMAN P, et al. Game theory[M]. London: Palgrave Macmillan, 1989.
[10] JIA R, DAO D, WANG B, et al. Efficient task-specific data valuation for nearest neighbor algorithms[J]. Proceedings of the VLDB Endowment, 2019, 12(11): 1610-1623.
[11] KARLA B, LI P, WU R, et al. Nearest neighbor classifiers over incomplete information: from certain answers to certain predictions[J]. Proceedings of the VLDB Endowment, 2020, 14(3): 255-267.
[12] MALEKI S. Addressing the computational issues of the Shapley value with applications in the smart grid[D]. Southampton: University of Southampton, 2015.
[13] GHORBANI A, ZOU J. Data Shapley: equitable valuation of data for machine learning[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 2242-2251.
[14] ELISSEEFF A, PONTIL M. Leave-one-out error and stability of learning algorithms with applications[J]. Advances in Learning Theory: Methods, Models and Applications, 2003, 190: 111-130.
[15] YEH I C, YANG K J, TING T M. Knowledge discovery on RFM model using Bernoulli sequence[J]. Expert Systems with Applications, 2009, 36(3): 5866-5871.
[16] National Aeronautics and Space Administration. Airfoil self-noise data set[EB/OL]. (2014-03-04) [2022-03-30]. https://archive.ics.uci.edu/ml/datasets/Air-foil+ Self-Noise.
[17] YEH I C, HSU T K. Building real estate valuation models with comparative approach through case-based reasoning[J]. Applied Soft Computing, 2018, 65: 260-271.
[18] KRISHNAN S, FRANKLIN M J, GOLDBERG K, et al. BoostClean: automated error detection and repair for machine learning[J]. arXiv:1711.01299, 2017.
[19] SUN X, LIU Y, LI J, et al. Using cooperative game theory to optimize the feature selection problem[J]. Neurocomputing, 2012, 97: 86-93.
[20] DENG X, PAPADIMITRIOU C H. On the complexity of cooperative solution concepts[J]. Mathematics of Operations Research, 1994, 19(2): 257-266.
[21] BACHRACH Y, MARKAKIS E, PROCACCIA A D, et al. Approximating power indices[C]//Proceedings of the 2008 International Joint Conference on Autonomous Agents and Multiagent Systems, Estoril, May 12-16, 2008. New York: ACM, 2008: 943-950.
[22] CHU X, MORCOS J, ILYAS I F, et al. KATARA: a data cleaning system powered by knowledge bases and crowd-sourcing[C]//Proceedings of the 2015 ACM SIGMOD Inter-national Conference on Management of Data, Melbourne, May 31-Jun 4, 2015. New York: ACM, 2015: 1247-1261.
[23] YAKOUT M, BERTI-éQUILLE L, ELMAGARMID A K. Don??t be scared: use scalable automatic repairing with max-imal likelihood and bounded changes[C]//Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, Jun 22-27, 2013. New York: ACM, 2013: 553-564.
[24] MAYFIELD C, NEVILLE J, PRABHAKAR S. ERACER: a database approach for statistical inference and data cleaning[C]//Proceedings of the 2010 ACM SIGMOD Inter-national Conference on Management of Data, Indianapolis, Jun 6-11, 2010. New York: ACM, 2010: 75-86.
[25] REKATSINAS T, CHU X, ILYAS I F, et al. Holo-Clean: holistic data repairs with probabilistic inference[J]. arXiv:1702.00820, 2017.
[26] BERGMAN M, MILO T, NOVGORODOV S, et al. Query-oriented data cleaning with oracles[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Manage-ment of Data, Melbourne, May 31-Jun 4, 2015. New York: ACM, 2015: 1199-1214.
[27] KRISHNAN S, WANG J, FRANKLIN M J, et al. Sample-Clean: fast and reliable analytics on dirty data[J]. IEEE Data Engineering Bulletin, 2015, 38(3): 59-75.
[28] KRISHNAN S, WANG J, WU E, et al. ActiveClean: inter-active data cleaning for statistical modeling[J]. Proceedings of the VLDB Endowment, 2016, 9(12): 948-959.
[29] CHEN Y, HASSANI S H, KARBASI A, et al. Sequential information maximization: when is greedy near-optimal? [C]//Proceedings of the 28th Conference on Learning Theory, Paris, Jul 3-6, 2015: 338-363.