🤖 AI Summary
To address core challenges in university GPU cluster management—including resource sharing difficulties, high operational overhead, and unfair scheduling—this paper proposes SING, an end-to-end system. Methodologically, SING introduces: (1) a four-layer scalable architecture balancing deployment simplicity and maintainability; (2) a novel lightweight shared-operations paradigm tailored to campus environments, drastically reducing manual intervention; and (3) an integrated design combining containerized orchestration, fine-grained GPU isolation, priority- and quota-driven dynamic scheduling, and unified identity authentication with audit logging. Empirical evaluation demonstrates a 3.2× improvement in GPU resource utilization, a 76% reduction in operational response time, and stable support for over one hundred faculty and students conducting large-model training. All core components are open-sourced.
📝 Abstract
The rapid advancement of large machine learning (ML) models has driven universities worldwide to invest heavily in GPU clusters. Effectively sharing these resources among multiple users is essential for maximizing both utilization and accessibility. However, managing shared GPU clusters presents significant challenges, ranging from system configuration to fair resource allocation among users. This paper introduces SING, a full-stack solution tailored to simplify shared GPU cluster management. Aimed at addressing the pressing need for efficient resource sharing with limited staffing, SING enhances operational efficiency by reducing maintenance costs and optimizing resource utilization. We provide a comprehensive overview of its four extensible architectural layers, explore the features of each layer, and share insights from real-world deployment, including usage patterns and incident management strategies. As part of our commitment to advancing shared ML cluster management, we open-source SING's resources to support the development and operation of similar systems.