Journal of Frontiers of Computer Science and Technology ›› 2022, Vol. 16 ›› Issue (4): 822-834.DOI: 10.3778/j.issn.1673-9418.2009098

• Database Technology • Previous Articles     Next Articles

Workload Characterization of Online and Offline Services in Co-located Data Centers

CHEN Shenglei1, QIU Yitao1, JIANG Congfeng1,+(), ZHANG Jilin1, YU Jun1, LIN Jiangbin2, YAN Longchuan3, REN Zujie4, WAN Jian5   

  1. 1. School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China
    2. AliCloud Computing Co., Ltd., Hangzhou 311121, China
    3. State Grid Electrical Information Communication Co., Ltd., Beijing 100053, China
    4. Zhejiang Lab, Hangzhou 311121, China
    5. School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China
  • Received:2020-09-07 Revised:2020-11-02 Online:2022-04-01 Published:2020-11-06
  • About author:CHEN Shenglei, born in 1996, M.S. candidate. Her research interest is cloud computing.
    QIU Yitao, born in 1995, M.S. candidate. His research interest is cloud computing.
    JIANG Congfeng, born in 1980, Ph.D., professor, Ph.D. supervisor. His research interest is cloud computing.
    ZHANG Jilin, born in 1980, professor, Ph.D. supervisor. His research interest is high performance computing.
    YU Jun, born in 1980, Ph.D., professor, Ph.D. supervisor. His research interest is machine learning.
    LIN Jiangbin, born in 1983, M.S., software engineer. His research interest is distributed storage systems.
    YAN Longchuan, born in 1978, Ph.D. candidate. His research interests include green computing, cloud computing and deep learning.
    REN Zujie, born in 1984, Ph.D. His research interest is distributed storage systems.
    WAN Jian, born in 1969, Ph.D., professor. His research interests include cloud computing and big data analytics.
  • Supported by:
    National Natural Science Foundation of China(61972118);National Key Research and Development Program of China(2017YFB1010000);Key Research and Development Program of Zhejiang Province(2017C01SA160069)

混部数据中心在线离线服务特征分析

陈圣蕾1, 裘翼滔1, 蒋从锋1,+(), 张纪林1, 俞俊1, 林江彬2, 闫龙川3, 任祖杰4, 万健5   

  1. 1.杭州电子科技大学 计算机学院,杭州 310018
    2.阿里云计算有限公司,杭州 311121
    3.国网电力信息通信有限公司,北京 100053
    4.之江实验室,杭州 311121
    5.浙江科技学院 信息与电子工程学院,杭州 310023
  • 通讯作者: + E-mail: cjiang@hdu.edu.cn
  • 作者简介:陈圣蕾(1996—),女,浙江缙云人,硕士研究生,主要研究方向为云计算。
    裘翼滔(1995—),男,浙江宁波人,硕士研究生,主要研究方向为云计算。
    蒋从锋(1980—),男,湖北襄阳人,博士,教授,博士生导师,主要研究方向为云计算。
    张纪林(1980—),男,山东济南人,教授,博士生导师,主要研究方向为高性能计算。
    俞俊(1980—),男,博士,教授,博士生导师,主要研究方向为机器学习。
    林江彬(1983—),男,硕士,软件工程师,主要研究方向为分布式系统。
    闫龙川(1978—),男,博士研究生,主要研究方向为绿色计算、云计算、深度学习。
    任祖杰(1984—),男,博士,主要研究方向为分布式系统。
    万健(1969—),男,博士,教授,主要研究方向为云计算、大数据分析。
  • 基金资助:
    国家自然科学基金(61972118);国家重点研发计划(2017YFB1010000);浙江省重点研发计划(2017C01SA160069)

Abstract:

In order to reduce cost, energy consumption and improve the utilization of cloud data center resources, many cloud data centers currently use a co-allocated pattern of online services and offline batch workload. Though the co-allocated approach can bring many benefits to the data center, it adds complexity to task scheduling and brings a range of challenges such as high reliability and low latency. This paper delves into the operation of all online services and offline batch workload for the Alibaba Data Center 4034 server cluster for a period of 8 days. From the results of the data analysis, following conclusions are drawn. Firstly, from the perspective of the operation of online service, the average CPU utilization of all containers has a cyclical change, which is maintained at a high level from 8:00 am to 9:00 pm every day, and falls back to the lowest point at 4 am every day. Secondly, for offline tasks, except the first and the eighth day, the peaks of task submissions for the remaining six days are concentrated at the same time each day. The running time of 95% of the instances is within 199 s, but there are 0.052% of the instances with running time of more than one hour or even a few days. Thirdly, for the application-related situation, there are large differences in the number of containers deployed in different applications. One application uses up to 629 containers and at least 1 container. Finally, cluster analysis is conducted on servers, online tasks and batch instances. Containers with relatively high resource utilization account for the vast majority of all containers, while instances with low resource utilization and short execution time account for the vast majority of all instances. The findings and recommendations in this paper can help data center managers understand the characteristics of co-located workloads more detailedly, thereby improving resource utilization and fault tolerance for each task.

Key words: data center, workload characteristics, online service, offline task, scheduling

摘要:

为了在降低成本和减少能耗的同时提高云数据中心的资源利用率,目前许多云数据中心都采用了在线服务和离线任务混合部署的方式。虽然混合部署的方式能为数据中心带来许多益处,但它增加了任务调度的复杂性,同时对保障服务的高可靠、低延迟带来了一系列的挑战。深入分析了阿里巴巴数据中心中某一个含有4 034台服务器的集群在8天时间内所有在线服务和离线任务的运行状况。从数据分析结果中得出以下结论:首先,从在线服务的运行情况来看,所有容器的平均CPU利用率存在周期性变化,在每天的早8点到晚9点维持在一个较高水平,并且在每天凌晨4点回落到最低点。其次,对离线任务来说,除去第一天和第八天,剩下6天中任务提交峰值都集中在每天的同一时刻。95%实例的运行时间都在199 s以内,但是有0.052%的实例运行时间在1 h以上甚至会持续几天。然后,对于应用程序的相关情况,不同应用部署的容器数量存在较大差异,一个应用最多使用629个容器,最少使用1个容器。最后,对服务器、在线任务以及批处理实例进行了聚类分析,相对高资源利用率的容器占了所有容器的绝大部分,低资源利用率、短执行时间的实例则占了总实例的绝大部分。提出的发现和建议有助于数据中心管理者更详细地了解工作负载的特性,从而提高数据中心的资源利用率和各任务的容错性。

关键词: 数据中心, 工作负载特性, 在线服务, 离线任务, 调度

CLC Number: