ReclaimNet: Reclaim-Aware Network Protocols for Voluntary GPU Sharing on Campus

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of supporting voluntary GPU sharing in campus environments, where resources are subject to revocation and existing migration mechanisms—assuming static failures and unlimited transfer windows—are inadequate. The authors propose a network-layer live migration protocol that treats resource reclamation as a first-class contract, explicitly modeling provider-initiated revocation as a core constraint. Their approach co-optimizes reclamation-aware checkpoint scheduling, volatility-aware destination node selection, and sub-millisecond deadline-aware traffic control leveraging TC BPF. Evaluated over two months on a 54-node heterogeneous testbed, the system reduces job loss by 66% compared to Slurm preemption with requeueing and by 38% against pipelined redundant checkpointing, while cutting downtime by 38% and incurring less than 3% performance degradation on background scientific workloads.
📝 Abstract
University campuses host abundant but fragmented GPU resources whose voluntary sharing is blocked by a mismatch between revocable, autonomous ownership and migration mechanisms that assume stationary failure hazards, homogeneous interconnects, and unbounded transfer windows. We present ReclaimNet, a network-layer migration protocol suite that treats provider reclaim as a first-class contract rather than a failure case, combining three mechanisms: (i) reclaim-aware checkpoint scheduling that jointly adapts to time-varying departure hazards and contended bandwidth across co-resident jobs; (ii) volatility-aware destination selection integrating topology, survival probability, and notice-window feasibility; and (iii) deadline-aware migration traffic control with edge enforcement and a submillisecond TC BPF kill-switch. A two-month deployment on a 54-node heterogeneous campus testbed reduces work loss by 66% over Slurm preempt-and-requeue and 38% over pipeline-redundancy checkpointing, with 38% shorter downtime and under 3% degradation of background research traffic. The prototype is open-sourced at the anonymous repository https://anonymous.4open.science/r/ICNP2026-ReclaimNet/.
Problem

Research questions and friction points this paper is trying to address.

GPU sharing
resource reclaim
network migration
voluntary sharing
heterogeneous infrastructure
Innovation

Methods, ideas, or system contributions that make the work stand out.

reclaim-aware migration
voluntary GPU sharing
checkpoint scheduling
deadline-aware traffic control
heterogeneous campus testbed
🔎 Similar Papers
2024-10-09International Conference on Architectural Support for Programming Languages and Operating SystemsCitations: 0