Design and Operation of Shared Machine Learning Clusters on Campus

📅 2021-10-04

🏛️ International Conference on Architectural Support for Programming Languages and Operating Systems

📈 Citations: 35

✨ Influential: 0

career value

237K/year

🤖 AI Summary

To address core challenges in university GPU cluster management—including resource sharing difficulties, high operational overhead, and unfair scheduling—this paper proposes SING, an end-to-end system. Methodologically, SING introduces: (1) a four-layer scalable architecture balancing deployment simplicity and maintainability; (2) a novel lightweight shared-operations paradigm tailored to campus environments, drastically reducing manual intervention; and (3) an integrated design combining containerized orchestration, fine-grained GPU isolation, priority- and quota-driven dynamic scheduling, and unified identity authentication with audit logging. Empirical evaluation demonstrates a 3.2× improvement in GPU resource utilization, a 76% reduction in operational response time, and stable support for over one hundred faculty and students conducting large-model training. All core components are open-sourced.

📝 Abstract

The rapid advancement of large machine learning (ML) models has driven universities worldwide to invest heavily in GPU clusters. Effectively sharing these resources among multiple users is essential for maximizing both utilization and accessibility. However, managing shared GPU clusters presents significant challenges, ranging from system configuration to fair resource allocation among users. This paper introduces SING, a full-stack solution tailored to simplify shared GPU cluster management. Aimed at addressing the pressing need for efficient resource sharing with limited staffing, SING enhances operational efficiency by reducing maintenance costs and optimizing resource utilization. We provide a comprehensive overview of its four extensible architectural layers, explore the features of each layer, and share insights from real-world deployment, including usage patterns and incident management strategies. As part of our commitment to advancing shared ML cluster management, we open-source SING's resources to support the development and operation of similar systems.

Problem

Research questions and friction points this paper is trying to address.

Managing shared GPU clusters in universities efficiently

Addressing resource allocation challenges in ML clusters

Reducing maintenance costs while maximizing resource utilization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Full-stack solution for shared GPU clusters

Achieves low maintenance and high utilization

Open-sources resources for ML cluster management

🔎 Similar Papers

No similar papers found.