计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (10): 2551-2572.DOI: 10.3778/j.issn.1673-9418.2312034
胡程,陈仕鸿
出版日期:
2024-10-01
发布日期:
2024-09-29
HU Cheng, CHEN Shihong
Online:
2024-10-01
Published:
2024-09-29
摘要: 分布式服务资源环境下,由于峰值负载的整体占比很小,大量服务资源长期处于低利用率甚至闲置状态。通过实现资源自适应弹性伸缩,在高负载时扩充服务资源以应对高需求,在低负载时将其缩减以降低开销,可显著提升系统能效并降低运作开销。但实际负载通常具有很强的波动性,满足服务质量所需的服务资源持续变化,这给服务资源自适应弹性伸缩带来了巨大挑战。尽管现有的商用分布式平台已普遍具有一定的资源弹性伸缩能力,但它们的自适应能力有限、精准性不佳,存在很大提升空间。为促进该领域的研究与应用发展,就该环境下服务资源自适应弹性伸缩研究进行分类分析与探讨。分析并介绍了相应的研究背景及主要存在于需求评估与资源调整上的挑战;就该领域的国内外相关研究,依据其调整的资源对象分为三类,以此进行分类论述并比较了各研究工作的异同,且就各自的特点与效用进行了分析与总结;总述分析了这些研究工作并概括出一个全面而整体的实现,探讨了业界的应用现状、研究面临的挑战以及未来趋势。
胡程, 陈仕鸿. 分布式服务资源自适应弹性伸缩研究综述[J]. 计算机科学与探索, 2024, 18(10): 2551-2572.
HU Cheng, CHEN Shihong. Survey of Adaptive Elastic Scaling Studies on Distributed Service Resources[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(10): 2551-2572.
[1] 田雨萌, 刘志波, 张凯, 等. 云边资源协同中的任务卸载技术综述[J]. 计算机科学与探索, 2023, 17(10): 2325-2342. TIAN Y M, LIU Z B, ZHANG K, et al. Survey of task offloading technology in cloud-edge resource collaboration[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(10): 2325-2342. [2] BALLIU A, OLIVETTI D, BABAOGLU O, et al. A big data analyzer for large trace logs[J]. Computing, 2016, 98(12): 1225-1249. [3] 吴虹佳, 刘芳, 刘斌, 等. 分散计算:技术、应用与挑战[J]. 计算机科学与探索, 2020, 14(5): 721-730. WU H J, LIU F, LIU B, et al. Dispersed computing: technologies, applications and challenges[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(5): 721-730. [4] ATMACA T, BEGIN T, BRANDWAJN A, et al. Performance evaluation of cloud computing centers with general arrivals and service[J]. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(8): 2341-2348. [5] 郭军, 武静, 邢留冬, 等. 面向突发业务的云服务并发量应对策略研究[J]. 计算机学报, 2019, 42(4): 190-208. GUO J, WU J, XING L D, et al. A coping strategy for bursty workload of cloud service[J]. Chinese Journal of Computers, 2019, 42(4): 190-208. [6] LIN W, XU S, HE L, et al. Multi-resource scheduling and power simulation for cloud computing[J]. Information Sciences, 2017, 397/398: 168-186. [7] ENTEZARI-MALEKI R, SOUSA L, MOVAGHAR A. Performance and power modeling and evaluation of virtualized servers in IaaS clouds[J]. Information Sciences, 2017, 394/395: 106-122. [8] ZHONG Z H, XU M X, RODRIGUEZ M A, et al. Machine learning-based orchestration of containers: a taxonomy and future directions[J]. ACM Computing Surveys, 2022, 54(10s): 1-35. [9] WU Y, MIN G, LI K, et al. Modeling and analysis of communication networks in multicluster systems under spatio-temporal bursty traffic[J]. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(5): 902-912. [10] HU C, DENG Y, MIN G, et al. QoS promotion in energy-efficient datacenters through peak load scheduling[J]. IEEE Transactions on Cloud Computing, 2021, 9(2): 777-792. [11] 陈斌, 白晓颖, 马博, 等. 分布式系统可伸缩性研究综述[J]. 计算机科学, 2011, 38(8): 17-24. CHEN B, BAI X Y, MA B, et al. Survey on software scalability of distributed systems[J]. Computer Science, 2011, 38(8): 17-24. [12] 王晶, 方伟, 陈静怡, 等. 云计算环境下的自适应资源管理技术综述[J]. 计算机工程与设计, 2012, 33(6): 2127-2132. WANG J, FANG W, CHEN J Y, et al. Survey on adaptive resource management techniques in cloud computing environment[J]. Computer Engineering and Design, 2012, 33(6): 2127-2132. [13] 钱琼芬, 李春林, 张小庆, 等. 云数据中心虚拟资源管理研究综述[J]. 计算机应用研究, 2012, 29(7): 2411-2415. QIAN Q F, LI C L, ZHANG X Q, et al. Survey of virtual resource management in cloud data center[J]. Application Research of Computers, 2012, 29(7): 2411-2415. [14] 唐续豪, 刘发贵, 王彬, 等. 跨云环境下任务调度综述[J]. 计算机研究与发展, 2023, 60(6): 1262-1275. TANG X H, LIU F G, WANG B, et al. Survey on task scheduling in inter-cloud environment[J]. Journal of Computer Research and Development, 2023, 60(6): 1262-1275. [15] 陈红华, 崔翛龙, 王耀杰. 基于多种云环境的任务调度算法综述[J]. 计算机应用研究, 2023, 40(10): 2889-2895. CHEN H H, CUI X L, WANG Y J. Summary of task scheduling algorithms based on multiple cloud environments[J]. Application Research of Computers, 2023, 40(10): 2889-2895. [16] 王凌, 吴楚格, 范文慧. 边缘计算资源分配与任务调度优化综述[J]. 系统仿真学报, 2021, 33(3): 509-520. WANG L, WU C G, FAN W H. A survey of edge computing resource allocation and task scheduling optimization[J]. Journal of System Simulation, 2021, 33(3): 509-520. [17] HU C, DENG Y. Aggregating correlated cold data to minimize the performance degradation and power consumption of cold storage nodes[J]. The Journal of Supercomputing, 2019, 75(2): 662-687. [18] YU J, KIM J, SEO E. Know your enemy to save cloud energy: energy-performance characterization of machine learning serving[C]//Proceedings of the 29th IEEE International Symposium on High-Performance Computer Architecture, Montreal, Feb 25-Mar 1, 2023. Piscataway: IEEE, 2023: 842-854. [20] KALBASI A, KRISHNAMURTHY D, ROLIA J, et al. MODE: mix driven on-line resource demand estimation[C]//Proceedings of the 7th International Conference on Network and Service Management, Paris, Oct 24-28, 2011. Piscataway: IEEE, 2011: 1-9. [20] JORDAN M G, KOROL G, KNORST T, et al. Energy-aware fully-adaptive resource provisioning in collaborative CPU-FPGA cloud environments[J]. Journal of Parallel and Distributed Computing, 2023, 176: 55-69. [21] BRATEK P, SZUSTAK L, WYRZYKOWSKI R, et al. Reducing energy consumption using heterogeneous voltage frequency scaling of data-parallel applications for multicore systems[J]. Journal of Parallel and Distributed Computing, 2023, 175: 121-133. [22] 刘伟, 尹行, 段玉光, 等. 同构DVS集群中基于自适应阈值的并行任务节能调度算法[J]. 计算机学报, 2013, 36(2): 393-407. LIU W, YIN H, DUAN Y G, et al. Adaptive threshold-based energy-efficient scheduling algorithm for parallel tasks on homogeneous DVS-enabled clusters[J]. Chinese Journal of Computers, 2013, 36(2): 393-407. [23] PéREZ J F, PACHECO-SANCHEZ S, CASALE G. An offline demand estimation method for multi-threaded applications[C]//Proceedings of the 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, San Francisco, Aug 14-16, 2013. Piscataway: IEEE, 2013: 21-30. [24] PéREZ J F, CASALE G, PACHECO-SANCHEZ S. Estimating computational requirements in multi-threaded applications[J]. IEEE Transactions on Software Engineering, 2015, 41(3): 264-278. [25] CHENG D, RAO J, JIANG C, et al. Elastic power-aware resource provisioning of heterogeneous workloads in self-sustainable datacenters[J]. IEEE Transactions on Computers, 2016, 65(2): 508-521. [26] 赵小刚, 胡启平, 丁玲, 等. 基于模型预测控制的数据中心节能调度算法[J]. 软件学报, 2017, 28(2): 429-442. ZHAO X G, HU Q P, DING L, et al. Energy saving scheduling strategy based on model prediction control for data centers[J]. Journal of Software, 2017, 28(2): 429-442. [27] ZHAO J, UWIZEYIMANA I, GANESAN K, et al. ALTOCUMULUS: scalable scheduling for nanosecond-scale remote procedure calls[C]//Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture, Chicago, Oct 1-5, 2022. Piscataway: IEEE, 2022: 423-440. [28] WANG J, LI X, RUIZ R, et al. Energy utilization task scheduling for mapreduce in heterogeneous clusters[J]. IEEE Transactions on Services Computing, 2020, 15(2): 931-944. [29] CARVER B, HAN R, ZHANG J, et al. λFS: a scalable and elastic distributed file system metadata service using serverless functions[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vancouver, Mar 25-29, 2023. New York: ACM, 2023: 394-411. [30] KHAIRY M, ALAWNEH A, BARNES A, et al. SIMR: single instruction multiple request processing for energy-efficient data center microservices[C]//Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture, Chicago, Oct 1-5, 2022. Piscataway: IEEE, 2022: 441-463. [31] 朱紫钰, 汤小春, 赵全. 面向CPU-GPU集群的分布式机器学习资源调度框架研究[J]. 西北工业大学学报, 2021, 39(3): 529-538. ZHU Z Y, TANG X C, ZHAO Q. A unified schedule policy of distributed machine learning framework for CPU-GPU cluster[J]. Journal of Northwestern Polytechnical University, 2021, 39(3): 529-538. [32] 汤小春, 朱紫钰, 毛安琪, 等. 数据密集作业在GPU集群上的调度算法研究[J]. 软件学报, 2022, 33(12): 4429-4451. TANG X C, ZHU Z Y, MAO A Q, et al. Algorithm of scheduling for data-intensive computing operations onto GPU cluster[J]. Journal of Software, 2022, 33(12): 4429-4451. [33] 傅懋钟, 胡海洋, 李忠金. 面向GPU集群的动态资源调度方法[J]. 计算机研究与发展, 2023, 60(6): 1308-1321. FU M Z, HU H Y, LI Z J. Dynamic resource scheduling method for GPU cluster[J]. Journal of Computer Research and Development, 2023, 60(6): 1308-1321. [34] LI M, XIAO W, YANG H, et al. EasyScale: elastic training with consistent accuracy and improved utilization on GPUs[C]//Proceedings of the 2023 International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, Nov 12-17, 2023. New York: ACM, 2023: 1-14. [35] 罗刚毅, 钱柱中, 陆桑璐. 一种基于网络感知的虚拟机再调度算法[J]. 计算机学报, 2015, 38(5): 932-943. LUO G Y, QIAN Z Z, LU S L. A network-aware VM re-scheduling algorithm[J]. Chinese Journal of Computers, 2015, 38(5): 932-943. [36] CHIARAVIGLIO L, D??ANDREAGIOVANNI F, LANCELLOTTI R, et al. An approach to balance maintenance costs and electricity consumption in cloud data centers[J]. IEEE Transactions on Sustainable Computing, 2018, 3(4): 274-288. [37] ISLAM M T, KARUNASEKERA S, BUYYA R. Performance and cost-efficient spark job scheduling based on deep reinforcement learning in cloud computing environments[J]. IEEE Transactions on Parallel and Distributed Systems, 2022, 33(7): 1695-1710. [38] TANG X, CAO W, TANG H, et al. Cost-efficient workflow scheduling algorithm for applications with deadline constraint on heterogeneous clouds[J]. IEEE Transactions on Parallel and Distributed Systems, 2022, 33(9): 2079-2092. [39] OSYPANKA P, NAWROCKI P. Qos-aware cloud resource prediction for computing services[J]. IEEE Transactions on Services Computing, 2023, 16(2): 1346-1357. [40] ZHAO G, WANG J, XU H, et al. Joint request updating and elastic resource provisioning with QoS guarantee in clouds[J]. IEEE/ACM Transactions on Networking, 2024, 32(1): 110-126. [41] 杨清波, 陈振宇, 刘东, 等. 基于容器的调控云PaaS平台的设计与实现[J]. 电网技术, 2020, 44(6): 2030-2037. YANG Q B, CHEN Z Y, LIU D, et al. Design and implementation of dispatching and control cloud PaaS platform based on container[J]. Power System Technology, 2020, 44(6): 2030-2037. [42] KAN C. DoCloud: an elastic cloud platform for Web applications based on Docker[C]//Proceedings of the 18th International Conference on Advanced Communication Technology, PyeongChang, Jan 31-Feb 3, 2016. Piscataway: IEEE, 2016: 478-483. [43] HE Z. Novel container cloud elastic scaling strategy based on Kubernetes[C]//Proceedings of the IEEE 5th Information Technology and Mechatronics Engineering Conference, Chongqing, Jun 12-14, 2020. Piscataway: IEEE, 2020: 1400-1404. [44] LI K, JI Y, LIU S, et al. ACEA: a queueing model-based elastic scaling algorithm for container cluster[J]. Wireless Communications and Mobile Computing, 2021(1): 6621094. [45] CAI Z, BUYYA R. Inverse queuing model-based feedback control for elastic container provisioning of Web systems in Kubernetes[J]. IEEE Transactions on Computers, 2021, 71(2): 337-348. [46] BENI E H, TRUYEN E, LAGAISSE B, et al. Reducing cold starts during elastic scaling of containers in Kubernetes[C]//Proceedings of the 36th Annual ACM Symposium on Applied Computing, Mar 22-26, 2021. New York: ACM, 2021: 60-68. [47] CHEN W, PI A, WANG S, et al. Pufferfish: container-driven elastic memory management for data-intensive applications[C]//Proceedings of the 19th ACM Symposium on Cloud Computing, Santa Cruz, Nov 20-23, 2019. New York: ACM, 2019: 259-271. [48] YU J, FENG D, TONG W, et al. CERES: container-based elastic resource management system for mixed workloads[C]//Proceedings of the 50th International Conference on Parallel Processing, Lemont, Aug 9-12, 2021. New York: ACM, 2021: 1-10. [49] CHOI J, CHO M, KIM J S. Employing vertical elasticity for efficient big data processing in container-based cloud environments[J]. Applied Sciences, 2021, 11(13): 6200. [50] MAO Y, SHARMA V, ZHENG W, et al. Elastic resource management for deep learning applications in a container cluster[J]. IEEE Transactions on Cloud Computing, 2023, 11(2): 2204-2216. [51] STRUHáR V, CRACIUNAS S S, ASHJAEI M, et al. Hierarchical resource orchestration framework for real-time containers[J]. ACM Transactions on Embedded Computing Systems, 2024, 23(1): 1-24. [52] ROSSI F, CARDELLINI V, PRESTI F L. Elastic deployment of software containers in geo-distributed computing environments[C]//Proceedings of the 24th IEEE Symposium on Computers and Communications, Barcelona, Jun 29-Jul 3, 2019. Piscataway: IEEE, 2019: 1-7. [53] ROSSI F, NARDELLI M, CARDELLINI V. Horizontal and vertical scaling of container-based applications using reinforcement learning[C]//Proceedings of the IEEE 12th International Conference on Cloud Computing, Milan, Jul 8-13, 2019. Piscataway: IEEE, 2019: 329-338. [54] ROSSI F, CARDELLINI V, PRESTI F L, et al. Dynamic multi-metric thresholds for scaling applications using reinforcement learning[J]. IEEE Transactions on Cloud Computing, 2023, 11(2): 1807-1821. [55] MEISNER D, GOLD B T, WENISCH T F. PowerNap: eliminating server idle power[J]. ACM SIGARCH Computer Architecture News, 2009, 37(1): 205-216. [56] KRIOUKOV A, MOHAN P, ALSPAUGH S, et al. NapSAC: design and implementation of a power-proportional web cluster[J]. ACM SIGCOMM Computer Communication Review, 2011, 14(1): 102-108. [57] GANDHI A, HARCHOL-BALTER M, RAGHUNATHAN R, et al. AutoScale: dynamic, robust capacity management for multi-tier data centers[J]. ACM Transactions on Computer Systems, 2012, 30(4): 1-26. [58] 林彬, 李姗姗, 廖湘科, 等. Seadown: 一种异构MapReduce集群中面向SLA的能耗管理方法[J]. 计算机学报, 2013, 36(5): 977-987. LIN B, LI S S, LIAO X K, et al. Seadown: SLA-aware size-scaling power management in heterogeneous MapReduce cluster[J]. Chinese Journal of Computers, 2013, 36(5): 977-987. [59] ENTRIALGO J, MEDRANO R, GARCíA D F, et al. Autonomic power management with self-healing in server clusters under QoS constraints[J]. Computing, 2016, 98(9): 871-894. [60] 廖彬, 张陶, 于炯, 等. MapReduce能耗建模及优化分析[J]. 计算机研究与发展, 2016, 53(9): 2107-2131. LIAO B, ZHANG T, YU J, et al. Energy consumption modeling and optimization analysis for MapReduce[J]. Journal of Computer Research and Development, 2016, 53(9): 2107-2131. [61] 杨挺, 王萌, 张亚健, 等. 云计算数据中心HDFS差异性存储节能优化算法[J]. 计算机学报, 2019, 42(4): 721-735. YANG T, WANG M, ZHANG Y J, et al. HDFS differential storage energy-saving optimal algorithm in cloud data center[J]. Chinese Journal of Computers, 2019, 42(4): 721-735. [62] GHETAS M. A multi-objective monarch butterfly algorithm for virtual machine placement in cloud computing[J]. Neural Computing and Applications, 2021, 33: 11011-11025. [63] BARTHWAL V, RAUTHAN M M S. AntPu: a meta-heuristic approach for energy-efficient and SLA aware management of virtual machines in cloud computing[J]. Memetic Computing, 2021, 13: 91-110. [64] 梁毅, 丁振兴, 赵昱, 等. 一种面向分布式深度学习系统的资源及批尺寸协同配置方法[J]. 计算机学报, 2022, 45(2): 302-316. LIANG Y, DING Z X, ZHAO Y, et al. A collaborative method for resource allocation and batch sizing on distributed deep learning system[J]. Chinese Journal of Computers, 2022, 45(2): 302-316. [65] GANDHI A, HARCHOL-BALTER M, DAS R, et al. Optimal power allocation in server farms[J]. ACM SIGMETRICS Performance Evaluation Review, 2009, 37(1): 157-168. [66] 胡亚红, 邱圆圆, 毛家发. 分布式异构集群中节点优先级调优算法[J]. 国防科技大学学报, 2022, 44(5): 102-113. HU Y H, QIU Y Y, MAO J F. Node priority optimization in distributed heterogeneous clusters[J]. Journal of National University of Defense Technology, 2022, 44(5): 102-113. [67] 胡亚红, 吴寅超, 朱正东. 节点实时性能自适应的集群资源分配算法[J]. 国防科技大学学报, 2022, 44(6): 144-150. HU Y H, WU Y C, ZHU Z D. Node real-time performance adaptive cluster resource scheduling algorithm[J]. Journal of National University of Defense Technology, 2022, 44(6): 144-150. [68] 毛安琪, 汤小春, 丁朝, 等. 集中式集群资源调度框架的可扩展性优化[J]. 计算机研究与发展, 2021, 58(3): 497-512. MAO A Q, TANG X C, DING Z, et al. Scalability for monolithic schedulers of cluster resource management framework[J]. Journal of Computer Research and Development, 2021, 58(3): 497-512. [69] TIAN C, LI L, SHI Z, et al. HARMONY: heterogeneity-aware hierarchical management for federated learning system[C]//Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture, Chicago, Oct 1-5, 2022. Piscataway: IEEE, 2022: 631-645. [70] 李少波, 杨磊, 李传江, 等. 联邦学习概述:技术、应用及未来[J]. 计算机集成制造系统, 2022, 28(7): 2119-2138. LI S B, YANG L, LI C J, et al. Overview of federated learning: technology, applications and future[J]. Computer Integrated Manufacturing System, 2022, 28(7): 2119-2138. [71] 吴再龙, 王利明, 徐震, 等. GPU虚拟化技术及其安全问题综述[J]. 信息安全学报, 2022, 7(2): 30-58. WU Z L,WANG L M, XU Z, et al. GPU virtualization technology and security issues: a survey[J]. Journal of Cyber Security, 2022, 7(2): 30-58. [72] JIANG J, QI J, SHEN T, et al. CRONUS: fault-isolated, secure and high-performance heterogeneous computing for trusted execution environment[C]//Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture, Chicago, Oct 1-5, 2022. Piscataway: IEEE, 2022: 124-143. [73] 周悦芝, 张迪. 近端云计算:后云计算时代的机遇与挑战[J]. 计算机学报, 2019, 42(4): 677-700. ZHOU Y Z, ZHANG D. Near-end cloud computing: opportunities and challenges in the post-cloud computing era[J]. Chinese Journal of Computers, 2019, 42(4): 677-700. [74] 王其朝, 金光淑, 李庆, 等. 工业边缘计算研究现状与展望[J]. 信息与控制, 2021, 50(3): 257-274. WANG Q Z, JIN G S, LI Q, et al. Industrial edge computing: vision and challenges[J]. Information and Control, 2021, 50(3): 257-274. [75] KIM S, ZHAO J, ASANOVIC K, et al. AuRORA: virtualized accelerator orchestration for multi-tenant workloads[C]//Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, Toronto, Oct 2-Nov 1, 2023. New York: ACM, 2023: 62-76. |
No related articles found! |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||