FailSafe: High-performance Resilient Serving

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Tensor parallelism (TP) in LLM inference is highly vulnerable to GPU failures, causing costly KV cache recomputation, computational/memory imbalance, and service disruption. To address this, we propose a fault-tolerant TP-aware inference architecture: (i) cyclic KV cache placement with proactive caching backup to ensure state consistency; (ii) hybrid attention mechanisms and fine-grained load-aware routing for dynamic rescheduling upon failure; and (iii) on-demand weight recovery integrated with fused TP/data parallelism execution to minimize redundancy. Evaluated on an 8×H100 cluster, our approach achieves 2× higher throughput than baseline methods, reduces fault recovery latency by two orders of magnitude, and sustains high-performance inference under up to three concurrent GPU failures. To the best of our knowledge, this is the first work to systematically co-optimize high availability and low-latency inference for TP-based LLMs.

Technology Category

Application Category

📝 Abstract
Tensor parallelism (TP) enables large language models (LLMs) to scale inference efficiently across multiple GPUs, but its tight coupling makes systems fragile: a single GPU failure can halt execution, trigger costly KVCache recomputation, and introduce long-term compute and memory imbalance. We present FailSafe, a fault-tolerant TP serving system that sustains high performance under irregular GPU availability. FailSafe introduces three techniques to balance computation and memory across GPUs: (1) Cyclic KVCache Placement for uniform memory utilization, (2) Hybrid Attention combining tensor- and data-parallel attention to eliminate stragglers, and (3) Fine-Grained Load-Aware Routing to dynamically balance requests. It further employs proactive KVCache backup and on-demand weight recovery to avoid expensive recomputation and redundant data transfers. We implement these techniques in a lightweight serving engine compatible with existing LLM infrastructures. Evaluated on an 8xH100 DGX system with real-world fault traces and representative workloads, FailSafe achieves up to 2x higher throughput and two orders of magnitude lower recovery latency compared to standard fault handling approaches. Even with up to three GPU failures, FailSafe sustains high throughput and balanced utilization, demonstrating robust and efficient LLM serving under dynamic and unreliable hardware conditions.
Problem

Research questions and friction points this paper is trying to address.

Enabling resilient LLM serving under GPU failures without performance degradation
Eliminating costly KVCache recomputation and memory imbalance during failures
Maintaining high throughput and low latency with dynamic GPU availability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cyclic KVCache Placement for uniform memory usage
Hybrid Attention combining tensor and data parallelism
Fine-Grained Load-Aware Routing for dynamic balancing