GPUnion: Autonomous GPU Sharing on Campus

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address uneven distribution and low utilization of GPU resources across campus laboratories, this paper designs and implements a decentralized GPU resource-sharing platform. The platform adopts a “provider-first” architecture to ensure laboratories retain full autonomy over their owned GPUs. It enables cross-laboratory resource collaboration under strong isolation and high security via non-root containerized scheduling, image attestation, custom storage integration, and elastic checkpoint-based migration. A key innovation is automated task migration upon provider disconnection—achieving a 94% success rate and significantly enhancing system robustness. Experimental evaluation demonstrates a 30% increase in average GPU utilization and a 40% rise in interactive sessions, thereby improving accessibility and sustainability of AI research infrastructure.

Technology Category

Application Category

📝 Abstract
A pronounced imbalance in GPU resources exists on campus, where some laboratories own underutilized servers while others lack the compute needed for AI research. GPU sharing can alleviate this disparity, while existing platforms typically rely on centralized oversight and persistent allocation models, conflicting with the voluntary and autonomous nature of academic resource ownership. We present GPUnion, a campus-scale GPU sharing platform enabling voluntary participation while preserving full provider autonomy. GPUnion incorporates three core mechanisms: i) container-based task dispatching and execution, ii) resource provider-first architecture, and iii) resilient execution featuring automatic checkpointing and migration. GPUnion also supports custom data storage and integrates the non-root execution and image attestation for isolation and security improvement for containerization. Case studies across multiple campus scenarios demonstrate 30% more GPU utilization improvement, 40% increase in interactive sessions, and 94% successful workload migration during provider departures. GPUnion demonstrates that provider autonomy and platform reliability can coexist, challenging conventional centralized paradigms and democratizing access to computational resources within campus networks.
Problem

Research questions and friction points this paper is trying to address.

Addresses GPU resource imbalance on campus for AI research
Enables autonomous GPU sharing without centralized oversight
Improves GPU utilization and workload migration reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Container-based task dispatching and execution
Resource provider-first architecture design
Automatic checkpointing and migration resilience
🔎 Similar Papers
No similar papers found.
Yufang Li
Yufang Li
Ph.D. Student, School of Environment, Tsinghua University
DesalinationReverse osmosis technologyBiofoulingBiofilm
Y
Yuanbo Zhang
Sun Yat-sen University, Guangzhou, China
H
Hanlong Liao
National University of Defense Technology, Changsha, China
Guoming Tang
Guoming Tang
The Hong Kong University of Science and Technology (Guangzhou)
Sustainable Computing/AICloud/Edge ComputingAI4Sus
D
Deke Guo
Sun Yat-sen University, Guangzhou, China