基于Spark框架的RDD数据块增益感知缓存替换策略

doi:10.3778/j.issn.1673-9418.2411014

摘要/Abstract

摘要： 缓存替换是Spark内存优化的一个研究热点和难点。然而，应用程序特征的多样性、内存资源的有限性以及缓存替换的不确定性对实现高系统执行性能构成了挑战。低效的缓存替换策略可能会导致不同的性能问题，如应用程序执行时间长，资源利用率低等。基于此，提出一种面向Spark大数据处理框架的弹性分布式数据集（RDD）数据块增益感知缓存替换策略。该策略建立了综合考虑数据块分区大小、引用计数、计算成本和资源成本影响因子的缓存价值评估模型，用于准确评估数据块的缓存价值。提出缓存增益问题模型，以形式化描述缓存管理的优化问题。提出RDD数据块增益感知缓存替换算法（CRCA），以确保内存中的RDD数据块带来的缓存增益最大化。为验证CRCA算法的有效性，基于Spark构建了一个真实的大数据集群实验平台，并采用HiBench基准测试工具中的多样化负载进行实验评估。结果表明，提出的缓存替换算法在任务执行时间和CPU利用率方面优于现有的最近最少使用算法（LRU）和最小分区权重算法（LPW）。

关键词: RDD数据块, 缓存增益, 缓存替换, Spark框架

Abstract: Cache replacement is a hot and difficult research topic in the field of memory optimization of Spark. However, the diversity of application characteristics, limited memory resources, and the uncertainty of cache replacement present challenges to achieve high system execution performance. Inefficient cache replacement strategies may lead to various performance issues, such as long application execution time and low resource utilization. Based on this, a gain-aware cache replacement strategy for RDD (resilient distributed dataset) data blocks on Spark is proposed. This strategy establishes a cache value assessment model that comprehensively considers partition size, reference count, computation cost, and resource cost impact factors of data blocks to accurately assess the cache value of data blocks. A cache gain problem model is proposed to formalize the optimization problem of cache management. A gain-aware cache replacement algorithm for RDD data blocks (CRCA) is proposed to ensure the maximization of cache benefits of RDD data blocks in memory. To validate the effectiveness of the CRCA algorithm, a real big data cluster experimental platform based on Spark is built, and experimental evaluations are conducted by using diversified workloads from the HiBench benchmark suite. The results show that the proposed cache replacement algorithm outperforms the existing least recently used (LRU) algorithm and least partition weight (LPW) algorithm in terms of task execution time and CPU utilization.

Key words: RDD data blocks, cache gain, cache replacement, Spark framework

贺莎, 唐小勇. 基于Spark框架的RDD数据块增益感知缓存替换策略[J]. 计算机科学与探索, 2025, 19(9): 2548-2558.

HE Sha, TANG Xiaoyong. Gain-Aware Cache Replacement Strategy for RDD Data Blocks on Spark[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(9): 2548-2558.

参考文献

[1] BHATTACHARYA D, CURRIM F, RAM S. Evaluating distributed computing infrastructures: an empirical study comparing hadoop deployments on cloud and local systems[J]. IEEE Transactions on Cloud Computing, 2021, 9(3): 1075-1088.
[2] TOSHNIWAL A, TANEJA S, SHUKLA A, et al. Storm@ twitter[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2014: 147-156.
[3] CHENG D Z, WANG Y, DAI D. Dynamic resource provisioning for iterative workloads on Apache Spark[J]. IEEE Transactions on Cloud Computing, 2021, 11(1): 639-652.
[4] XU L N, LI M, ZHANG L, et al. MEMTUNE: dynamic memory management for in-memory data analytic platforms[C]//Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium. Piscataway: IEEE, 2016: 383-392.
[5] CHEN S W, WANG W S, WU X Y, et al. Optimizing performance and computing resource management of in-memory big data analytics with disaggregated persistent memory[C]//Proceedings of the 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. Pisca-taway: IEEE, 2019: 21-30.
[6] LIN T H, LIN C H. Hyperspectral change detection using semi-supervised graph neural network and convex deep learning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5515818.
[7] ZHU W G, SUN Y Q, FANG R Q, et al. A low-memory community detection algorithm with hybrid sparse structure and structural information for large-scale networks[J]. IEEE Transactions on Parallel and Distributed Systems, 2023, 34(10): 2671-2683.
[8] ZHU J Y, YANG R Y, SUN X Y, et al. QoS-aware co-scheduling for distributed long-running applications on shared clusters[J]. IEEE Transactions on Parallel and Distributed Systems, 2022, 33(12): 4818-4834.
[9] CARLSSON N, EAGER D. Optimized dynamic cache instantiation and accurate LRU approximations under time-varying request volume[J]. IEEE Transactions on Cloud Computing, 2023, 11(1): 779-797.
[10] LI H, JI S P, ZHONG H, et al. LPW: an efficient data-aware cache replacement strategy for Apache Spark[J]. Science China Information Sciences, 2022, 66(1): 112104.
[11] LIU R N, ZHANG Q H, WANG Y, et al. Industrial big data analytical system in industrial cyber-physical systems based on coarse-to-fine deep network[J]. IEEE Transactions on Industrial Cyber-Physical Systems, 2023, 1: 359-370.
[12] SAIDI K, BARDOU D. Task scheduling and VM placement to resource allocation in cloud computing: challenges and opportunities[J]. Cluster Computing, 2023, 26(5): 3069-3087.
[13] BEHERA I, SOBHANAYAK S. Task scheduling optimization in heterogeneous cloud computing environments: a hybrid GA-GWO approach[J]. Journal of Parallel and Distributed Computing, 2024, 183: 104766.
[14] FRIEDLANDER E, AGGARWAL V. Generalization of LRU cache replacement policy with applications to video streaming[J]. ACM Transactions on Modeling and Performance Evaluation of Computing Systems, 2019, 4(3): 1-22.
[15] YANG P F, WANG Q, YE H W, et al. Partially shared cache and adaptive replacement algorithm for NoC-based many-core systems[J]. Journal of Systems Architecture, 2019, 98: 424-433.
[16] GENG Y Z, SHI X H, PEI C, et al. LCS: an efficient data eviction strategy for spark[J]. International Journal of Parallel Programming, 2017, 45(6): 1285-1297.
[17] YU Y H, ZHANG C L, WANG W, et al. Towards dependency-aware cache management for data analytics applications[J]. IEEE Transactions on Cloud Computing, 2022, 10(1): 706-723.
[18] DUAN M X, LI K L, TANG Z, et al. Selection and replacement algorithms for memory performance improvement in Spark[J]. Concurrency and Computation: Practice and Experience, 2016, 28(8): 2473-2486.
[19] JIANG K, DU S F, ZHAO F, et al. Effective data management strategy and RDD weight cache replacement strategy in Spark[J]. Computer Communications, 2022, 194: 66-85.
[20] LI C L, CAI Q Q, LUO Y L. Dynamic data replacement and adaptive scheduling policies in spark[J]. Cluster Computing, 2022, 25(2): 1421-1439.
[21] FU Z M, HE M S, YI Y, et al. Improving data locality of tasks by executor allocation in spark computing environment[J]. IEEE Transactions on Cloud Computing, 2024, 12(3): 876-888.
[22] DUAN Y B, WANG N, WU J. Accelerating DAG-style job execution via optimizing resource pipeline scheduling[J]. Journal of Computer Science and Technology, 2022, 37(4): 852-868.
[23] LI L S, WAN Z Q, HE H B. Incomplete multi-view clustering with joint partition and graph learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2021, 35(1): 589-602.
[24] SOUTO J V, CASTRO M. Improving concurrency and memory usage in distributed operating systems for lightweight manycores via cooperative time-sharing lightweight tasks[J]. Journal of Parallel and Distributed Computing, 2023, 174: 2-18.
[25] CHOUKSE E, SULLIVAN M B, O’CONNOR M, et al. Buddy compression: enabling larger memory for deep lear-ning and HPC workloads on GPUs[C]//Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture. Piscataway: IEEE, 2020: 926-939.
[26] XUE T, WEN Y, LUO B, et al. SparkAC: fine-grained access control in spark for secure data sharing and analytics[J]. IEEE Transactions on Dependable and Secure Computing, 2023, 20(2): 1104-1123.
[27] SINGH P, SINGH S, MISHRA P K, et al. A data structure perspective to the RDD-based Apriori algorithm on Spark[J]. International Journal of Information Technology, 2022, 14(3): 1585-1594.
[28] KIM Y K, HOSEINYFARAHABADY M R, LEE Y C, et al. Automated fine-grained CPU cap control in serverless computing platform[J]. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(10): 2289-2301.
[29] HUANG S S, HUANG J, DAI J Q, et al. The HiBench benchmark suite: characterization of the MapReduce-based data analysis[C]//Proceedings of the 2010 IEEE 26th International Conference on Data Engineering Workshops. Piscataway: IEEE, 2010: 41-51.
[30] MASHAYEKHY L, NEJAD M M, GROSU D, et al. Energy-aware scheduling of MapReduce jobs for big data applications[J]. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(10): 2720-2733.