计算机科学与探索 ›› 2021, Vol. 15 ›› Issue (1): 109-118.DOI: 10.3778/j.issn.1673-9418.2003048

• 网络与信息安全 • 上一篇    下一篇

基于k-prototype聚类的差分隐私混合数据发布算法

屈晶晶,蔡英,范艳芳,夏红科   

  1. 北京信息科技大学 计算机学院,北京 100101
  • 出版日期:2021-01-01 发布日期:2021-01-07

Differentially Private Mixed Data Release Algorithm Based on k-prototype Clustering

QU Jingjing, CAI Ying, FAN Yanfang, XIA Hongke   

  1. College of Computer, Beijing Information Science and Technology University, Beijing 100101, China
  • Online:2021-01-01 Published:2021-01-07

摘要:

差分隐私是一种提供强大隐私保护的模型。在非交互式框架下,数据管理者可发布采用差分隐私保护技术处理的数据集供研究人员进行挖掘分析。但是在数据发布过程中需要加入大量噪声,会破坏数据可用性。因此,提出了一种基于k-prototype聚类的差分隐私混合数据发布算法。首先改进k-prototype聚类算法,按数据类型的不同,对数值型属性和分类型属性分别选用不同的属性差异度计算方法,将混合数据集中更可能相关的记录分组,从而降低差分隐私敏感度;结合聚类中心值,采用差分隐私保护技术对数据记录进行处理保护,针对数值型属性使用Laplace机制,分类型属性使用指数机制;从差分隐私的概念及组合性质两方面对该算法进行隐私分析证明。实验结果表明:该算法能够有效提高数据可用性。

关键词: 差分隐私, 混合数据集, k-prototype, 聚类, 数据发布

Abstract:

Differential privacy is a model that provides strong privacy protection. Under the non-interactive frame-work, data managers can publish data sets processed by differential privacy protection technology for researchers to conduct mining and analysis. However, a lot of noise needs to be added in the data release process, which will destroy the data availability. Therefore, a differential privacy mixed data release algorithm based on k-prototype clus-tering is proposed. First, the k-prototype clustering algorithm is improved. According to different data types, different attribute difference calculation methods are selected for numerical attributes and sub-type attributes, and the more likely related records in the mixed datasets are grouped, thereby reducing the difference privacy sensitivity; Combined with the cluster center value, the differential privacy protection technology is used to process and protect data records, the Laplace mechanism is used for numerical attributes, and the exponential mechanism is used for typed attributes. The privacy analysis of the algorithm is carried out from the concept of differential privacy and the combined nature. Experimental results show that the algorithm can effectively improve data availability.

Key words: differential privacy, mixed datasets, k-prototype, clustering, data release