计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (4): 822-834.DOI: 10.3778/j.issn.1673-9418.2009098
陈圣蕾1, 裘翼滔1, 蒋从锋1,+(), 张纪林1, 俞俊1, 林江彬2, 闫龙川3, 任祖杰4, 万健5
收稿日期:
2020-09-07
修回日期:
2020-11-02
出版日期:
2022-04-01
发布日期:
2020-11-06
通讯作者:
+ E-mail: cjiang@hdu.edu.cn作者简介:
陈圣蕾(1996—),女,浙江缙云人,硕士研究生,主要研究方向为云计算。基金资助:
CHEN Shenglei1, QIU Yitao1, JIANG Congfeng1,+(), ZHANG Jilin1, YU Jun1, LIN Jiangbin2, YAN Longchuan3, REN Zujie4, WAN Jian5
Received:
2020-09-07
Revised:
2020-11-02
Online:
2022-04-01
Published:
2020-11-06
About author:
CHEN Shenglei, born in 1996, M.S. candidate. Her research interest is cloud computing.Supported by:
摘要:
为了在降低成本和减少能耗的同时提高云数据中心的资源利用率,目前许多云数据中心都采用了在线服务和离线任务混合部署的方式。虽然混合部署的方式能为数据中心带来许多益处,但它增加了任务调度的复杂性,同时对保障服务的高可靠、低延迟带来了一系列的挑战。深入分析了阿里巴巴数据中心中某一个含有4 034台服务器的集群在8天时间内所有在线服务和离线任务的运行状况。从数据分析结果中得出以下结论:首先,从在线服务的运行情况来看,所有容器的平均CPU利用率存在周期性变化,在每天的早8点到晚9点维持在一个较高水平,并且在每天凌晨4点回落到最低点。其次,对离线任务来说,除去第一天和第八天,剩下6天中任务提交峰值都集中在每天的同一时刻。95%实例的运行时间都在199 s以内,但是有0.052%的实例运行时间在1 h以上甚至会持续几天。然后,对于应用程序的相关情况,不同应用部署的容器数量存在较大差异,一个应用最多使用629个容器,最少使用1个容器。最后,对服务器、在线任务以及批处理实例进行了聚类分析,相对高资源利用率的容器占了所有容器的绝大部分,低资源利用率、短执行时间的实例则占了总实例的绝大部分。提出的发现和建议有助于数据中心管理者更详细地了解工作负载的特性,从而提高数据中心的资源利用率和各任务的容错性。
中图分类号:
陈圣蕾, 裘翼滔, 蒋从锋, 张纪林, 俞俊, 林江彬, 闫龙川, 任祖杰, 万健. 混部数据中心在线离线服务特征分析[J]. 计算机科学与探索, 2022, 16(4): 822-834.
CHEN Shenglei, QIU Yitao, JIANG Congfeng, ZHANG Jilin, YU Jun, LIN Jiangbin, YAN Longchuan, REN Zujie, WAN Jian. Workload Characterization of Online and Offline Services in Co-located Data Centers[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(4): 822-834.
表名 | 记录数 | 文件大小 |
---|---|---|
machine meta | 17 592 | 0.56 MB |
machine usage | 246 637 252 | 8.38 GB |
container meta | 370 540 | 22.26 MB |
container usage | 4 015 763 787 | 164.09 GB |
batch task | 14 295 731 | 0.98 GB |
batch instance | 1 351 255 775 | 103.69 GB |
表1 阿里巴巴数据集记录行数
Table 1 Alibaba dataset record line number
表名 | 记录数 | 文件大小 |
---|---|---|
machine meta | 17 592 | 0.56 MB |
machine usage | 246 637 252 | 8.38 GB |
container meta | 370 540 | 22.26 MB |
container usage | 4 015 763 787 | 164.09 GB |
batch task | 14 295 731 | 0.98 GB |
batch instance | 1 351 255 775 | 103.69 GB |
状态 | started | stopped | allocated | unknow | 总计 |
---|---|---|---|---|---|
容器数量 | 70 903 | 400 | 39 | 0 | 71 342 |
表2 只出现一种状态的容器数量
Table 2 Container amount with only one state
状态 | started | stopped | allocated | unknow | 总计 |
---|---|---|---|---|---|
容器数量 | 70 903 | 400 | 39 | 0 | 71 342 |
状态 | started,stopped | started,allocated | started,unknow | 总计 |
---|---|---|---|---|
容器数量 | 118 | 10 | 5 | 133 |
表3 出现过两种状态的容器数量
Table 3 Container amount with two states
状态 | started,stopped | started,allocated | started,unknow | 总计 |
---|---|---|---|---|
容器数量 | 118 | 10 | 5 | 133 |
状态 | started,allocated,stopped |
---|---|
容器数量 | 1 |
表4 出现过三种状态的容器数量
Table 4 Container amount with three states
状态 | started,allocated,stopped |
---|---|
容器数量 | 1 |
Fitting function | Resource category | | | | R-square |
---|---|---|---|---|---|
| CPU | 5 792 | -0.086 | — | 0.977 |
memory | 0.003 | 0.146 | — | 0.617 | |
disk_io | 14 430 | 8.203 | 2.452 | 0.992 |
表5 容器资源利用率分布拟合函数以及参数
Table 5 Fitting function and parameter value of container resource usage distribution
Fitting function | Resource category | | | | R-square |
---|---|---|---|---|---|
| CPU | 5 792 | -0.086 | — | 0.977 |
memory | 0.003 | 0.146 | — | 0.617 | |
disk_io | 14 430 | 8.203 | 2.452 | 0.992 |
Feature vectorgroup name | Average CPU | Average memory | Average disk |
---|---|---|---|
mGroup0 | 35.775~44.424 | 81.836~92.608 | 3.298~32.604 |
mGroup1 | 0.000 2~28.393 | 2.999~48.931 | 0~25.944 |
mGroup2 | 2.257~21.418 | 52.162~96.156 | 0.611~24.015 |
mGroup3 | 39.908~60.559 | 81.562~92.383 | 2.940~28.120 |
mGroup4 | 34.719~58.793 | 81.577~92.079 | 40.815~98.108 |
mGroup5 | 20.179~37.649 | 49.439~94.623 | 2.737~20.349 |
表6 所有服务器特征指标的边界
Table 6 Boundaries of feature vectors for servers
Feature vectorgroup name | Average CPU | Average memory | Average disk |
---|---|---|---|
mGroup0 | 35.775~44.424 | 81.836~92.608 | 3.298~32.604 |
mGroup1 | 0.000 2~28.393 | 2.999~48.931 | 0~25.944 |
mGroup2 | 2.257~21.418 | 52.162~96.156 | 0.611~24.015 |
mGroup3 | 39.908~60.559 | 81.562~92.383 | 2.940~28.120 |
mGroup4 | 34.719~58.793 | 81.577~92.079 | 40.815~98.108 |
mGroup5 | 20.179~37.649 | 49.439~94.623 | 2.737~20.349 |
Feature vectorgroup name | Average CPU | Average memory | Average disk |
---|---|---|---|
cGroup0 | 0.000 1~99.989 | 64.208 5~100.000 | 0.628~98.897 |
cGroup1 | 0~99.999 8 | 1.008~69.847 | 0~98.910 2 |
表7 所有容器特征指标的边界
Table 7 Boundaries of feature vectors for containers
Feature vectorgroup name | Average CPU | Average memory | Average disk |
---|---|---|---|
cGroup0 | 0.000 1~99.989 | 64.208 5~100.000 | 0.628~98.897 |
cGroup1 | 0~99.999 8 | 1.008~69.847 | 0~98.910 2 |
Feature vector group name | Average CPU | Average memory | Duration/s |
---|---|---|---|
iGroup0 | 0~4 257 | 0~11.130 0 | 271~221 229 |
iGroup1 | 0~2 135 | 0~91.599 9 | 1~296 |
表8 抽样实例特征指标的边界
Table 8 Boundaries of feature vectors for instance
Feature vector group name | Average CPU | Average memory | Duration/s |
---|---|---|---|
iGroup0 | 0~4 257 | 0~11.130 0 | 271~221 229 |
iGroup1 | 0~2 135 | 0~91.599 9 | 1~296 |
Feature vector group name | cGroup0 | cGroup1 | iGroup0 | iGroup1 |
---|---|---|---|---|
mGroup0 | 0.74 | 0.26 | 0.03 | 0.97 |
mGroup1 | 0.75 | 0.25 | 0.03 | 0.97 |
mGroup2 | 0.79 | 0.21 | 0.03 | 0.97 |
mGroup3 | 0.81 | 0.19 | 0.04 | 0.96 |
mGroup4 | 0.78 | 0.22 | 0.04 | 0.96 |
mGroup5 | 0.67 | 0.33 | 0.03 | 0.97 |
表9 每类服务器中两类容器及两类实例的数量占比
Table 9 Proportion of two types of containers and instances in each type of server
Feature vector group name | cGroup0 | cGroup1 | iGroup0 | iGroup1 |
---|---|---|---|---|
mGroup0 | 0.74 | 0.26 | 0.03 | 0.97 |
mGroup1 | 0.75 | 0.25 | 0.03 | 0.97 |
mGroup2 | 0.79 | 0.21 | 0.03 | 0.97 |
mGroup3 | 0.81 | 0.19 | 0.04 | 0.96 |
mGroup4 | 0.78 | 0.22 | 0.04 | 0.96 |
mGroup5 | 0.67 | 0.33 | 0.03 | 0.97 |
[1] | JYOTHI S A, CURINO C, MENACHE I, et al. Morpheus: towards automated SLOs for enterprise clusters[C]// Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, Savannah, Nov 2-4, 2016. Berkeley: USENIX Association, 2016: 117-134. |
[2] | RAJAN K, KAKADIA D, CURINO C, et al. PerfOrator: eloquent performance models for resource optimization[C]// Proceedings of the 7th ACM Symposium on Cloud Computing, Santa Clara, Oct 5-7, 2016. New York: ACM, 2016: 415-427. |
[3] | XU G Y, XU C Z. Prometheus: online estimation of optimal memory demands for workers in in-memory distributed computation[C]// Proceedings of the 2017 Symposium on Cloud Computing, Santa Clara, Sep 24-27, 2017. New York: ACM, 2017: 655-655. |
[4] | REISS C, TUMANOV A, GANGER G R, et al. Towards understanding heterogeneous clouds at scale: Google trace analysis: ISTC-CC-TR-12-101[R]. Pittsburgh: Carnegie Mellon University, 2012. |
[5] |
ZHOU M S, DONG X S, CHEN H, et al. Fine-grained scheduling in multi-resource clusters[J]. The Journal of Supercomputing, 2020, 76(3):1931-1958.
DOI URL |
[6] | ZOU D Q, QIAN S Y, XUE G T, et al. UpPreempt: a fine-grained preemptive scheduling strategy for container-based clusters[C]// Proceedings of the 24th IEEE International Conference on Parallel and Distributed Systems, Singapore, Dec 11-13, 2018. Piscataway: IEEE, 2018: 373-380. |
[7] |
BI J, YUAN H T, TAN W, et al. Application-aware dynamic fine-grained resource provisioning in a virtualized cloud data center[J]. IEEE Transactions on Automation Science and Engineering, 2017, 14(2):1172-1184.
DOI URL |
[8] |
USMANI Z, SINGH S. A survey of virtual machine placement techniques in a cloud data center[J]. Procedia Computer Science, 2016, 78:491-498.
DOI URL |
[9] |
AHMAD R W, GANI A, HAMId S H A, et al. A survey on virtual machine migration and server consolidation frameworks for cloud data centers[J]. Journal of Network and Computer Applications, 2015, 52:11-25.
DOI URL |
[10] | TOSATTO A, RUIU P, ATTANASIO A. Container-based orchestration in cloud: state of the art and challenges[C]// Proceedings of the 9th International Conference on Complex, Intelligent, and Software Intensive Systems, Santa Catarina, Jul 8-10, 2015. Washington: IEEE Computer Society, 2015: 70-75. |
[11] |
GOUDARZI H, PEDRAM M. Hierarchical SLA-driven resource management for peak power-aware and energy-efficient operation of a cloud datacenter[J]. IEEE Transactions on Cloud Computing, 2016, 4(2):222-236.
DOI URL |
[12] | PETRUCCI V, LAURENZANO M A, DOHERTY J, et al. Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers[C]// Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, Burlingame, Feb 7-11, 2015. Washi- ngton: IEEE Computer Society, 2015: 246-258. |
[13] | CHEN W, RAO J, ZHOU X B. Preemptive, low latency datacenter scheduling via lightweight virtualization[C]// Proceedings of the 2017 USENIX Annual Technical Conference, Santa Clara, Jul 12-14, 2017. Berkeley: USENIX Association, 2017: 251-263. |
[14] | YAN Y, GAO Y J, CHEN Y, et al. TR-spark: transient computing for big data analytics[C]// Proceedings of the 7th ACM Symposium on Cloud Computing, Santa Clara, Oct 5-7, 2016. New York: ACM, 2016: 484-496. |
[15] | CHEN S, DELIMITROU C, MARTÍNEZ J F. Parties: Qos-aware resource partitioning for multiple interactive services[C]// Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, Providence, Apr 13-17, 2019. New York: ACM, 2019: 107-120. |
[16] | ISLAM M A, GANDHI A, REN S L. Minimizing electricity cost for geo-distributed interactive services with tail latency constraint[C]// Proceedings of the 7th International Green and Sustainable Computing Conference, Hangzhou, Nov 7-9, 2016. Washington: IEEE Computer Society, 2016: 1-8. |
[17] | CHENG Y, ANWAR A, DUAN X J. Analyzing Alibaba’s co-located datacenter workloads[C]// Proceedings of the 2018 IEEE International Conference on Big Data, Seattle, Dec 10-13, 2018. Piscataway: IEEE, 2018: 292-297. |
[18] |
ZHANG Z, LI C, TAO Y Y, et al. Fuxi: a fault-tolerant resource management and job scheduling system at Internet scale[J]. Proceedings of the VLDB Endowment, 2014, 7(13):1393-1404.
DOI URL |
[19] | Pouch container engine[EB/OL]. [2020-06-23]. https://github.com/Alibaba/pouch. |
[20] | Alibaba trace[EB/OL]. [2020-06-23]. https://github.com/alibaba/clusterdata. |
[21] | DENG L, REN Y L, XU F, et al. Resource utilization analysis of Alibaba cloud[C]// LNCS 10954: Proceedings of the 14th International Conference on Intelligent Computing Theories and Application, Wuhan, Aug 15-18, 2018. Cham: Springer, 2018: 183-194. |
[22] |
ARORA P, VARSHNEY S. Analysis of k-means and k-medoids algorithm for big data[J]. Procedia Computer Science, 2016, 78:507-512.
DOI URL |
[23] | ŁUKASIK S, KOWALSKI P A, CHARYTANOWICZ M, et al. Clustering using flower pollination algorithm and Calinski-Harabasz index[C]// Proceedings of the 2016 IEEE Congress on Evolutionary Computation, Vancouver, Jul 24-29, 2016. Piscataway: IEEE, 2016: 2724-2728. |
[24] | SHISHIRA S R, KANDASAMY A, CHANDRASEKARAN K. Workload characterization: survey of current approaches and research challenges[C]// Proceedings of the 7th International Conference on Computer and Communication Technology, Allahabad, Nov 24-26, 2017. New York: ACM, 2017: 151-156. |
[25] | Google trace[EB/OL]. [2020-06-23]. https://github.com/google/cluster-data. |
[26] | REISS C, TUMANOV A, GANGER G R, et al. Heterogeneity and dynamicity of clouds at scale: Google trace analysis[C]// Proceedings of the 3rd ACM Symposium on Cloud Computing. New York: ACM, 2012: 7. |
[27] | FAN Z W, HUANG P J, HUANG P S, et al. A feature generation framework for Google trace analysis[C]// Proceedings of the 2015 International Conference on Machine Learning and Cybernetics. Piscataway: IEEE, 2015: 229-234. |
[28] | LU C Z, YE K J, XU G Y, et al. Imbalance in the cloud: an analysis on Alibaba cluster trace[C]// Proceedings of the 2017 IEEE International Conference on Big Data, Boston, Dec 11-14, 2017. Washington: IEEE Computer Society, 2017: 2884-2892. |
[29] | CHENG Y, CHAI Z, ANWAR A, Characterizing co-located datacenter workloads: an Alibaba case study[J]. arXiv: 1808. 02919, 2018. |
[30] | LIU Q X, YU Z B. The elasticity and plasticity in semi-containerized co-locating cloud workload: a view from Alibaba trace[C]// Proceedings of the 2018 ACM Symposium on Cloud Computing, Carlsbad, Oct 11-13, 2018. New York: ACM, 2018: 347-360. |
[31] | CHEN Y, GANAPATHI A S, GRIFFITH R, et al. Analysis and lessons from a publicly available Google cluster trace: UCB/EECS-2010-95[R]. Berkeley: University of California, 2010. |
[32] | ALAM M, SHAKIL K A, SETHI S. Analysis and clustering of workload in Google cluster trace based on resource usage[C]// Proceedings of the 2016 IEEE International Conference on Computational Science and Engineering, and IEEE International Conference on Embedded and Ubiquitous Computing, and 15th International Symposium on Distributed Computing and Applications for Business Engineering, Paris, Aug 24-26, 2016. Washington: IEEE Computer Society, 2016: 740-747. |
[33] | CHEN W Y, YE K J, WANG Y, et al. How does the workload look like in production cloud? Analysis and clustering of workloads on Alibaba cluster trace[C]// Proceedings of the 24th IEEE International Conference on Parallel and Distributed Systems, Singapore, Dec 11-13, 2018. Piscataway: IEEE, 2018: 102-109. |
[1] | 余达明, 张震. FSDC:灵活的高可扩展数据中心网络结构[J]. 计算机科学与探索, 2022, 16(4): 855-864. |
[2] | 官铮, 胡扬, 杨志军, 何敏. 分布式WLAN全双工链路加权调度算法[J]. 计算机科学与探索, 2022, 16(2): 372-383. |
[3] | 李成严, 宋月, 马金涛. 模糊云资源调度问题的RIOPSO算法[J]. 计算机科学与探索, 2021, 15(8): 1534-1545. |
[4] | 叶进, 谢紫琪, 肖庆宇, 宋玲, 李晓欢. 数据中心网络中基于ELM的流簇大小推理机制[J]. 计算机科学与探索, 2021, 15(2): 261-269. |
[5] | 郭羽含,伊鹏. 车辆合乘问题的分布式复合变邻域搜索算法[J]. 计算机科学与探索, 2019, 13(2): 330-341. |
[6] | 孙怀英,虞慧群,范贵生,陈丽琼. 支持SDN的Hadoop中的时间最小化任务调度[J]. 计算机科学与探索, 2018, 12(11): 1767-1776. |
[7] | 裴树军,宋冬梅,孔德凯. Map/Reduce下快速剪枝算法在复杂任务调度中的应用[J]. 计算机科学与探索, 2018, 12(1): 72-81. |
[8] | 余雅君,刘峥,徐明伟. 数据中心网络TCP Incast问题研究[J]. 计算机科学与探索, 2017, 11(9): 1361-1378. |
[9] | 李芳芳,刘冲,于戈. 面向CPS的时间戳不确定事件调度算法[J]. 计算机科学与探索, 2017, 11(6): 887-896. |
[10] | 张忆文,王成. 可靠性感知周期任务能耗管理调度算法[J]. 计算机科学与探索, 2017, 11(5): 833-841. |
[11] | 张飞朋,陈琳,张京京. 面向大规模复杂网络测量和性能瓶颈分析方法[J]. 计算机科学与探索, 2017, 11(2): 262-270. |
[12] | 王德胜,张伟哲,郝萌,鲁刚钊,白恩慈. 云计算环境中虚拟机内存自适应调节算法研究[J]. 计算机科学与探索, 2017, 11(1): 70-79. |
[13] | 朱鹏,何琨,曹伟刚,杨欢. 基于穴度的三维时空优化问题的贪心调度算法[J]. 计算机科学与探索, 2016, 10(8): 1051-1062. |
[14] | 高任飞,武继刚,周莹,张耀国. 极小通讯延迟的虚拟机分配算法[J]. 计算机科学与探索, 2016, 10(7): 924-935. |
[15] | 周丹,葛洪伟,苏树智,袁运浩. 基于紧凑度和调度处理的粒子群优化算法[J]. 计算机科学与探索, 2016, 10(5): 742-750. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||