Differentially Private Mixed Data Release Algorithm Based on k-prototype Clustering

doi:10.3778/j.issn.1673-9418.2003048

Abstract

Abstract:

Differential privacy is a model that provides strong privacy protection. Under the non-interactive frame-work, data managers can publish data sets processed by differential privacy protection technology for researchers to conduct mining and analysis. However, a lot of noise needs to be added in the data release process, which will destroy the data availability. Therefore, a differential privacy mixed data release algorithm based on k-prototype clus-tering is proposed. First, the k-prototype clustering algorithm is improved. According to different data types, different attribute difference calculation methods are selected for numerical attributes and sub-type attributes, and the more likely related records in the mixed datasets are grouped, thereby reducing the difference privacy sensitivity; Combined with the cluster center value, the differential privacy protection technology is used to process and protect data records, the Laplace mechanism is used for numerical attributes, and the exponential mechanism is used for typed attributes. The privacy analysis of the algorithm is carried out from the concept of differential privacy and the combined nature. Experimental results show that the algorithm can effectively improve data availability.

Key words: differential privacy, mixed datasets, k-prototype, clustering, data release

摘要：

差分隐私是一种提供强大隐私保护的模型。在非交互式框架下，数据管理者可发布采用差分隐私保护技术处理的数据集供研究人员进行挖掘分析。但是在数据发布过程中需要加入大量噪声，会破坏数据可用性。因此，提出了一种基于k-prototype聚类的差分隐私混合数据发布算法。首先改进k-prototype聚类算法，按数据类型的不同，对数值型属性和分类型属性分别选用不同的属性差异度计算方法，将混合数据集中更可能相关的记录分组，从而降低差分隐私敏感度；结合聚类中心值，采用差分隐私保护技术对数据记录进行处理保护，针对数值型属性使用Laplace机制，分类型属性使用指数机制；从差分隐私的概念及组合性质两方面对该算法进行隐私分析证明。实验结果表明：该算法能够有效提高数据可用性。

关键词: 差分隐私, 混合数据集, k-prototype, 聚类, 数据发布

QU Jingjing, CAI Ying, FAN Yanfang, XIA Hongke. Differentially Private Mixed Data Release Algorithm Based on k-prototype Clustering[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(1): 109-118.

屈晶晶, 蔡英, 范艳芳, 夏红科. 基于k-prototype聚类的差分隐私混合数据发布算法[J]. 计算机科学与探索, 2021, 15(1): 109-118.

References

[1] SWEENEY L. K-anonymity: a model for protecting privacy[J]. International Journal on Uncertainty, Fuzziness and Knowledge Based Systems, 2002, 10(5): 571-578.
[2] MACHANAVA A, GEHRKE J, KIFER D. l-diversity: privacy beyond k-anonymity[J]. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): 1-52.
[3] LI N H, LI T C, VENKATASUBRAMANIAN S. t-Closeness: privacy beyond k-anonymity and l-diversity[C]//Proceedings of the 23rd International Conference on Data Engineering, Istanbul, Apr 15-20, 2007. Washington: IEEE Computer Society, 2007: 106-115.
[4] DWORK C, MCSHERRY F, NISSIM K, et al. Calibrating noise to sensitivity in private data analysis[C]//LNCS 3876: Proceedings of the 3rd Theory of Cryptography Conference, New York, Mar 4-7, 2006. Berlin, Heidelberg: Springer, 2006: 265-284.
[5] AZHARUDDIN S, SHRUTI P. A survey on privacy enhanced role based data aggregation via differential privacy[C]//Pro-ceedings of the 2018 International Conference on Advances in Communication and Computing Technology, India, Oct 12-13, 2018. Piscataway: IEEE, 2018: 285-290.
[6] LEE K, KIM H. Synthesizing differentially private datasets using random mixing[C]//Proceedings of the 2019 IEEE International Symposium on Information Theory, Paris, Jul 7-12, 2019. Piscataway: IEEE, 2019: 542-546.
[7] CHENG X, TANG P, SU S, et al. Multi-party high-dimensional data publishing under differential privacy[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 32(8): 1557-1571.
[8] YANG G M, YE X X, FANG X J, et al. Associated attribute-aware differentially private data publishing via microaggre-gation[J]. IEEE Access, 2020, 8: 79158-79168.
[9] LI H, CUI J T, LIN X B, et al. Improving the utility in diff-erential private histogram publishing: theoretical study and practice[C]//Proceedings of the 2017 IEEE International Con-ference on Big Data, Boston, Dec 5-8, 2017. Washington:IEEE Computer Society, 2017: 1100-1109.
[10] TANG Z L, LONG S G. Differential privacy histogram pub-lishing based on hybrid mechanism[J]. Journal of Guizhou University (Natural Sciences), 2018, 35(4): 32-36.
唐正莉, 龙士工. 基于混合机制下的差分隐私直方图发布[J]. 贵州大学学报(自然科学版), 2018, 35(4): 32-36.
[11] YAN F, ZHANG X, LI C, et al. Differentially private histo-gram publishing through fractal dimension for dynamic dat-asets[C]//Proceedings of the 2018 IEEE Conference on Ind-ustrial Electronics and Applications, Wuhan, May 31-Jun 2, 2018. Piscataway: IEEE, 2018: 1542-1546.
[12] FANAEEPOUR M, RUBINSTEIN B I P. Histogramming privately ever after: differentially-private data-dependent error bound optimisation[C]//Proceedings of the 2018 IEEE 34th International Conference on Data Engineering, Paris, Apr 16-19, 2018. Washington: IEEE Computer Society, 2018: 1204-1207.
[13] LI N, QARDAJI W, SU D. Provably private data anonymi-zation: or, k‐anonymity meets differential privacy[J]. arXiv:1101.2604, 2010.
[14] ZHAO X W, LIANG J Y. An attribute weighted clustering algorithm for mixed data based on information entropy[J]. Journal of Computer Research and Development, 2016, 53(5): 1018-1028.
赵兴旺, 梁吉业. 一种基于信息熵的混合数据属性加权聚类算法[J]. 计算机研究与发展, 2016, 53(5): 1018-1028.
[15] SORIA-COMAS J, DOMINGO-FERRER J, SáNCHEZ D, et al. Improving the utility of differentially private data releases via k-anonymity[C]//Proceedings of the 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, Melbourne, Jul 16-18, 2013. Washington: IEEE Computer Society, 2013: 372-379.
[16] SORIA-COMAS J, DOMINGO-FERRER J, SáNCHEZ D, et al. Enhancing data utility in differential privacy via micro-aggregation-based k-anonymity[J]. The VLDB Journal, 2014, 23(5): 771-794.
[17] SáNCHEZ D, DOMINGO-FERRER J, MARTíNEZ S, et al. Utility-preserving differentially private data releases via indi-vidual ranking microaggregation[J]. Information Fusion, 2016, 30: 1-14.
[18] SORIA-COMAS J, DOMINGO-FERRER J. Differentially private data publishing via optimal univariate microaggrega-tion and record perturbation[J]. Knowledge-Based Systems, 2018, 153: 78-90.
[19] PARRA-ARNAU J, DOMINGO-FERRER J, SORIA-COMAS J. Differentially private data publishing via cross-moment microaggregation[J]. Information Fusion, 2020, 53: 269-288.
[20] LIU X Q, LI Q M. Differentially private data release based on clustering anonymization[J]. Journal on Communications, 2016, 37(5): 125-129.
刘晓迁, 李千目. 基于聚类匿名化的差分隐私保护数据发布方法[J]. 通信学报, 2016, 37(5): 125-129.
[21] WANG H, GE L N, WANG S Q, et al. Improvement of differential privacy protection algorithm based on OPTICS clustering[J]. Journal of Computer Applications, 2018, 38(1): 73-78.
王红, 葛丽娜, 王苏青, 等. 基于OPTICS聚类的差分隐私保护算法的改进[J]. 计算机应用, 2018, 38(1): 73-78.
[22] DWORK C, LEI J. Differential privacy and robust statistics[C]//Proceedings of the 41st Annual ACM Symposium on Theory of Computing, Bethesda, May 31-Jun 2, 2009. New York: ACM, 2009: 371-380.
[23] DWORK C, NAOR M, REINGOLD O, et al. On the comp-lexity of differentially private data release: efficient algorithms and hardness results[C]//Proceedings of the 41st Annual ACM Symposium on Theory of Computing, Bethesda, May 31-Jun 2, 2009. New York: ACM, 2009: 381-390.
[24] MCSHERRY F D. Privacy integrated queries: an extensible platform for privacy-preserving data analysis[J]. Communi-cations of the ACM, 2010, 53(9): 89-97.