NAOC Open IR
DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters
Peng, Yanghua1; Bao, Yixin1; Chen, Yangrui1; Wu, Chuan1; Meng, Chen2; Lin, Wei3
2021-08-01
Source PublicationIEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
ISSN1045-9219
Volume32Issue:8Pages:1947-1960
AbstractEfficient resource scheduling is essential for maximal utilization of expensive deep learning (DL) clusters. Existing cluster schedulers either are agnostic to machine learning (ML) workload characteristics, or use scheduling heuristics based on operators' understanding of particular ML framework and workload, which are less efficient or not general enough. In this article, we show that DL techniques can be adopted to design a generic and efficient scheduler. Specifically, we propose DL2, a DL-driven scheduler for DL clusters, targeting global training job expedition by dynamically resizing resources allocated to jobs. DL2 advocates a joint supervised learning and reinforcement learning approach: a neural network is warmed up via offline supervised learning based on job traces produced by the existing cluster scheduler; then the neural network is plugged into the live DL cluster, fine-tuned by reinforcement learning carried out throughout the training progress of the DL jobs, and used for deciding job resource allocation in an online fashion. We implement DL2 on Kubernetes and enable dynamic resource scaling in DL jobs on MXNet. Extensive evaluation shows that DL2 outperforms fairness scheduler (i.e., DRF) by 44.1 percent and expert heuristic scheduler (i.e., Optimus) by 17.5 percent in terms of average job completion time.
KeywordDeep learning resource allocation distributed training
Funding OrganizationAlibaba Group through Alibaba Innovative Research (AIR) Program ; Alibaba Group through Alibaba Innovative Research (AIR) Program ; Hong Kong RGC ; Hong Kong RGC ; Alibaba Group through Alibaba Innovative Research (AIR) Program ; Alibaba Group through Alibaba Innovative Research (AIR) Program ; Hong Kong RGC ; Hong Kong RGC ; Alibaba Group through Alibaba Innovative Research (AIR) Program ; Alibaba Group through Alibaba Innovative Research (AIR) Program ; Hong Kong RGC ; Hong Kong RGC ; Alibaba Group through Alibaba Innovative Research (AIR) Program ; Alibaba Group through Alibaba Innovative Research (AIR) Program ; Hong Kong RGC ; Hong Kong RGC
DOI10.1109/TPDS.2021.3052895
Language英语
Funding ProjectAlibaba Group through Alibaba Innovative Research (AIR) Program ; Hong Kong RGC[HKU 17204619] ; Hong Kong RGC[17208920]
Funding OrganizationAlibaba Group through Alibaba Innovative Research (AIR) Program ; Alibaba Group through Alibaba Innovative Research (AIR) Program ; Hong Kong RGC ; Hong Kong RGC ; Alibaba Group through Alibaba Innovative Research (AIR) Program ; Alibaba Group through Alibaba Innovative Research (AIR) Program ; Hong Kong RGC ; Hong Kong RGC ; Alibaba Group through Alibaba Innovative Research (AIR) Program ; Alibaba Group through Alibaba Innovative Research (AIR) Program ; Hong Kong RGC ; Hong Kong RGC ; Alibaba Group through Alibaba Innovative Research (AIR) Program ; Alibaba Group through Alibaba Innovative Research (AIR) Program ; Hong Kong RGC ; Hong Kong RGC
WOS Research AreaComputer Science ; Engineering
WOS SubjectComputer Science, Theory & Methods ; Engineering, Electrical & Electronic
WOS IDWOS:000622094200004
PublisherIEEE COMPUTER SOC
Citation statistics
Document Type期刊论文
Identifierhttp://ir.bao.ac.cn/handle/114a11/78898
Collection中国科学院国家天文台
Corresponding AuthorPeng, Yanghua
Affiliation1.Univ Hong Kong, Hong Kong, Peoples R China
2.NAOC, Beijing 100012, Peoples R China
3.Alibaba Inc, Hangzhou 311121, Zhejiang, Peoples R China
Recommended Citation
GB/T 7714
Peng, Yanghua,Bao, Yixin,Chen, Yangrui,et al. DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,2021,32(8):1947-1960.
APA Peng, Yanghua,Bao, Yixin,Chen, Yangrui,Wu, Chuan,Meng, Chen,&Lin, Wei.(2021).DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters.IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,32(8),1947-1960.
MLA Peng, Yanghua,et al."DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters".IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 32.8(2021):1947-1960.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Peng, Yanghua]'s Articles
[Bao, Yixin]'s Articles
[Chen, Yangrui]'s Articles
Baidu academic
Similar articles in Baidu academic
[Peng, Yanghua]'s Articles
[Bao, Yixin]'s Articles
[Chen, Yangrui]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Peng, Yanghua]'s Articles
[Bao, Yixin]'s Articles
[Chen, Yangrui]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.