Design and Operation of Shared Machine Learning Clusters on Campus

📅 2021-10-04
🏛️ International Conference on Architectural Support for Programming Languages and Operating Systems
📈 Citations: 35
Influential: 0
📄 PDF
🤖 AI Summary
To address core challenges in university GPU cluster management—including resource sharing difficulties, high operational overhead, and unfair scheduling—this paper proposes SING, an end-to-end system. Methodologically, SING introduces: (1) a four-layer scalable architecture balancing deployment simplicity and maintainability; (2) a novel lightweight shared-operations paradigm tailored to campus environments, drastically reducing manual intervention; and (3) an integrated design combining containerized orchestration, fine-grained GPU isolation, priority- and quota-driven dynamic scheduling, and unified identity authentication with audit logging. Empirical evaluation demonstrates a 3.2× improvement in GPU resource utilization, a 76% reduction in operational response time, and stable support for over one hundred faculty and students conducting large-model training. All core components are open-sourced.
📝 Abstract
The rapid advancement of large machine learning (ML) models has driven universities worldwide to invest heavily in GPU clusters. Effectively sharing these resources among multiple users is essential for maximizing both utilization and accessibility. However, managing shared GPU clusters presents significant challenges, ranging from system configuration to fair resource allocation among users. This paper introduces SING, a full-stack solution tailored to simplify shared GPU cluster management. Aimed at addressing the pressing need for efficient resource sharing with limited staffing, SING enhances operational efficiency by reducing maintenance costs and optimizing resource utilization. We provide a comprehensive overview of its four extensible architectural layers, explore the features of each layer, and share insights from real-world deployment, including usage patterns and incident management strategies. As part of our commitment to advancing shared ML cluster management, we open-source SING's resources to support the development and operation of similar systems.
Problem

Research questions and friction points this paper is trying to address.

Managing shared GPU clusters in universities efficiently
Addressing resource allocation challenges in ML clusters
Reducing maintenance costs while maximizing resource utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Full-stack solution for shared GPU clusters
Achieves low maintenance and high utilization
Open-sources resources for ML cluster management
🔎 Similar Papers
No similar papers found.
Kaiqiang Xu
Kaiqiang Xu
PhD, AI and ML Systems
ML SystemsCloud ComputingComputer Networks
Xinchen Wan
Xinchen Wan
ByteDance
AI NetworkingDatacenter NetworkingMachine Learning SystemHardware Acceleration
H
Hao Wang
Hong Kong University of Science and Technology, Hong Kong, Hong Kong
Z
Zhenghang Ren
Hong Kong University of Science and Technology, Hong Kong, Hong Kong
Xudong Liao
Xudong Liao
Hong Kong University of Science and Technology
Computer networksMachine Learning System
D
D. Sun
Hong Kong University of Science and Technology, Hong Kong, Hong Kong
Chaoliang Zeng
Chaoliang Zeng
K
Kai Chen
Hong Kong University of Science and Technology, Hong Kong, Hong Kong