计算机科学与探索 ›› 2025, Vol. 19 ›› Issue (9): 2532-2547.DOI: 10.3778/j.issn.1673-9418.2412075

• 大数据技术 • 上一篇    下一篇

具有复制鲁棒性的高效数据交易估值框架

陈思远,陈辰,袁野,李博扬   

  1. 1. 北京理工大学 计算机学院,北京 100081
    2. 北京理工大学 唐山研究院,河北 唐山 063000
  • 出版日期:2025-09-01 发布日期:2025-09-01

Efficient Data Trading Valuation Framework with Replication Robustness

CHEN Siyuan, CHEN Chen, YUAN Ye, LI Boyang   

  1. 1. School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081, China
    2. Tangshan Research Institute, Beijing Institute of Technology, Tangshan, Hebei 063000, China
  • Online:2025-09-01 Published:2025-09-01

摘要: 随着数据交易市场的兴起,数据价值评估成为关键技术问题。尽管数据夏普利值是一种公平的数据价值度量方法,但其高昂的计算成本和对数据复制攻击缺乏抵御能力,严重限制了在实际数据交易场景中的应用。提出了一种高效且具备复制鲁棒性的数据交易估值框架。针对数据夏普利值计算效率低下的问题,优化了数据集合效用计算后的更新策略,提出了一种高效的数据夏普利值近似算法OA-Shapley(one for all Shapley)。该算法通过单次效用计算更新所有数据点的夏普利值,显著提高了计算效率,并在理论上保证了算法的无偏性和均方误差。针对数据复制攻击问题,从理论上推导出严格冗员性是复制鲁棒性的充分条件,并基于此提出了CL+Shapley(Cluster+Shapley)数据估值框架。该框架通过聚类预处理实现严格冗员性,能够有效抵御数据复制攻击,并且与具体的数据夏普利算法解耦,具有广泛的适用性。实验结果表明,OA-Shapley算法在去除高(低)价值数据点实验中,AUC指标优于基线算法12.4%(3.5%),无效数据检出量增加9%~32%。CL+Shapley框架在复制攻击实验中展现出优异的复制鲁棒性。

关键词: 数据交易, 数据市场, 数据夏普利值, 复制鲁棒性, 聚类算法

Abstract: With the emergence of data trading markets, data valuation has become a key technological challenge. Although data Shapley value has been proven to be a fair method for measuring data value, its high computational cost and vulnerability to data replication attacks severely limit its application in real-world data trading scenarios. To address these issues, this paper proposes an efficient and replication-robust framework for data valuation. To improve the computational efficiency of data Shapley value, this paper optimizes the update strategy after utility calculation of data sets, and introduces an efficient approximation algorithm, OA-Shapley (one for all Shapley). This algorithm updates the Shapley values of all data points through a single utility calculation, significantly enhancing computational efficiency while theoretically guaranteeing the unbiasedness and mean squared error of the algorithm. To tackle the problem of data replication attacks, this paper theoretically derives that strict redundancy is a sufficient condition for replication robustness, and proposes the CL+Shapley (Cluster+Shapley) framework. This framework achieves strict redundancy through clustering preprocessing, effectively defending against data replication attacks and decoupling from specific data Shapley algorithms, thus ensuring wide applicability. Experimental results show that the OA-Shapley algorithm outperforms baseline algorithms by 12.4% (3.5%) in AUC when removing high (low) value data points, and increases the detection of invalid data by 9%~32%. The CL+Shapley framework also demonstrates excellent robustness against replication attacks.

Key words: data trading, data market, data Shapley value, replication robustness, clustering algorithm