计算机科学与探索

• 学术研究 •    下一篇

基于Spark框架的RDD数据块增益感知缓存替换策略

贺莎,唐小勇   

  1. 1.长沙理工大学 计算机与通信工程学院, 长沙 410114
    2.岳麓山实验室, 长沙 410114

A Gain-aware Cache Replacement Strategy for RDD Data Blocks on Spark

HE Sha,  TANG Xiaoyong   

  1. 1.School of Computer and Communications Engineering, Changsha University of Science & Technology, Changsha 410114, China
    2.Yuelushan Laboratory, Changsha 410128, China

摘要: 缓存替换是Spark内存优化的一个研究热点和难点。然而,应用程序特征的多样性、内存资源的有限性以及缓存替换的不确定性构成了实现高系统执行性能的挑战。低效的缓存替换策略可能会导致不同的性能问题,如应用程序执行时间长、资源利用率低等。基于此,研究人员提出一种面向Spark大数据处理框架的弹性分布式数据集(Resilient Distributed Dataset,RDD)数据块增益感知缓存替换策略。该策略首先建立了综合考虑数据块分区大小、引用计数、计算成本和资源成本影响因子的缓存价值评估模型用于准确评估数据块的缓存价值。然后,提出缓存增益问题模型用以形式化描述缓存管理的优化问题。最后,提出RDD数据块增益感知缓存替换算法(CRCA),以确保内存中的RDD数据块带来的缓存增益最大化。为验证CRCA算法的有效性,我们基于Spark构建了一个真实的大数据集群实验平台,并采用HiBench基准测试工具中的多样化负载进行实验评估。结果表明本文所提出的缓存替换算法(CRCA)在任务执行时间和CPU利用率方面优于现有的最近最少使用算法(LRU)和最小分区权重算法(LPW)。

关键词: RDD数据块, 缓存增益, 缓存替换, Spark处理框架

Abstract: Cache replacement is a hot and difficult research topic in the field of memory optimization of Spark. However, the diversity of application characteristics, limited memory resources, and the uncertainty of cache replacement present challenges to achieving high system execution performance. Inefficient cache replacement strategies may lead to various performance issues, such as long application execution time and low resource utilization. Based on this, researchers proposes a gain-aware cache replacement strategy for RDD data blocks on Spark. This strategy first establishes a cache value assessment model that comprehensively considers partition size, reference count, computation cost, and resource cost impact factors of data blocks to accurately assess the cache value of data blocks. Then, a cache gain problem model is proposed to formalize the optimization problem of cache management. Finally, a gain-aware cache replacement algorithm for RDD data blocks (CRCA) is proposed to ensure the maximization of cache benefits of RDD data blocks in memory. To validate the effectiveness of the CRCA algorithm, we built a real big data cluster experimental platform based on Spark and conducted experimental evaluations using diversified workloads from the HiBench benchmark suite The results show that the proposed cache replacement algorithm (CRCA) outperforms the existing least recently used(LRU) and least partition weight(LPW) algorithms in terms of task execution time and CPU utilization.

Key words: RDD data blocks, cache gain, cache replacement, Spark framework