计算机科学与探索 ›› 2025, Vol. 19 ›› Issue (9): 2548-2558.DOI: 10.3778/j.issn.1673-9418.2411014

• 大数据技术 • 上一篇    

基于Spark框架的RDD数据块增益感知缓存替换策略

贺莎,唐小勇   

  1. 1. 长沙理工大学 计算机与通信工程学院,长沙 410114 
    2. 岳麓山实验室,长沙 410128
  • 出版日期:2025-09-01 发布日期:2025-09-01

Gain-Aware Cache Replacement Strategy for RDD Data Blocks on Spark

HE Sha, TANG Xiaoyong   

  1. 1. School of Computer and Communications Engineering, Changsha University of Science & Technology, Changsha 410114, China 
    2. Yuelushan Laboratory, Changsha 410128, China
  • Online:2025-09-01 Published:2025-09-01

摘要: 缓存替换是Spark内存优化的一个研究热点和难点。然而,应用程序特征的多样性、内存资源的有限性以及缓存替换的不确定性对实现高系统执行性能构成了挑战。低效的缓存替换策略可能会导致不同的性能问题,如应用程序执行时间长,资源利用率低等。基于此,提出一种面向Spark大数据处理框架的弹性分布式数据集(RDD)数据块增益感知缓存替换策略。该策略建立了综合考虑数据块分区大小、引用计数、计算成本和资源成本影响因子的缓存价值评估模型,用于准确评估数据块的缓存价值。提出缓存增益问题模型,以形式化描述缓存管理的优化问题。提出RDD数据块增益感知缓存替换算法(CRCA),以确保内存中的RDD数据块带来的缓存增益最大化。为验证CRCA算法的有效性,基于Spark构建了一个真实的大数据集群实验平台,并采用HiBench基准测试工具中的多样化负载进行实验评估。结果表明,提出的缓存替换算法在任务执行时间和CPU利用率方面优于现有的最近最少使用算法(LRU)和最小分区权重算法(LPW)。

关键词: RDD数据块, 缓存增益, 缓存替换, Spark框架

Abstract: Cache replacement is a hot and difficult research topic in the field of memory optimization of Spark. However, the diversity of application characteristics, limited memory resources, and the uncertainty of cache replacement present challenges to achieve high system execution performance. Inefficient cache replacement strategies may lead to various performance issues, such as long application execution time and low resource utilization. Based on this, a gain-aware cache replacement strategy for RDD (resilient distributed dataset) data blocks on Spark is proposed. This strategy establishes a cache value assessment model that comprehensively considers partition size, reference count, computation cost, and resource cost impact factors of data blocks to accurately assess the cache value of data blocks. A cache gain problem model is proposed to formalize the optimization problem of cache management. A gain-aware cache replacement algorithm for RDD data blocks (CRCA) is proposed to ensure the maximization of cache benefits of RDD data blocks in memory. To validate the effectiveness of the CRCA algorithm, a real big data cluster experimental platform based on Spark is built, and experimental evaluations are conducted by using diversified workloads from the HiBench benchmark suite. The results show that the proposed cache replacement algorithm outperforms the existing least recently used (LRU) algorithm and least partition weight (LPW) algorithm in terms of task execution time and CPU utilization.

Key words: RDD data blocks, cache gain, cache replacement, Spark framework