AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address service interruptions, degraded availability, and unpredictable latency in large language model (LLM) inference under multi-GPU tensor parallelism (TP) caused by single-GPU failures, this paper proposes an Elastic State-Preserving Parallel (ESPP) framework. ESPP decouples state management daemons from the inference runtime to persistently retain model parameters and KV caches on surviving GPUs. It supports heterogeneous GPU partitioning (i.e., unequal-width TP groups) and is compatible with Mixture-of-Experts (MoE) architectures. Furthermore, it introduces a Continual Minimal-Migration algorithm and a bandwidth-aware reconfiguration planner to minimize recovery data volume. Experiments under realistic failure scenarios demonstrate that ESPP reduces first-successful-response time by up to 11× and cuts peak recovery time by 59%, significantly improving service availability and real-time responsiveness.

Technology Category

Application Category

📝 Abstract
Large Language Model (LLM) inference services demand exceptionally high availability and low latency, yet multi-GPU Tensor Parallelism (TP) makes them vulnerable to single-GPU failures. We present AnchorTP, a state-preserving elastic TP framework for fast recovery. It (i) enables Elastic Tensor Parallelism (ETP) with unequal-width partitioning over any number of GPUs and compatibility with Mixture-of-Experts (MoE), and (ii) preserves model parameters and KV caches in GPU memory via a daemon decoupled from the inference process. To minimize downtime, we propose a bandwidth-aware planner based on a Continuous Minimal Migration (CMM) algorithm that minimizes reload bytes under a byte-cost dominance assumption, and an execution scheduler that pipelines P2P transfers with reloads. These components jointly restore service quickly with minimal data movement and without changing service interfaces. In typical failure scenarios, AnchorTP reduces Time to First Success (TFS) by up to 11x and Time to Peak (TTP) by up to 59% versus restart-and-reload.
Problem

Research questions and friction points this paper is trying to address.

Resilient LLM inference with elastic tensor parallelism for GPU failures
Preserving model parameters and KV caches during inference recovery
Minimizing downtime through bandwidth-aware planning and pipelined execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Elastic tensor parallelism with unequal-width partitioning
State preservation via decoupled daemon for parameters
Bandwidth-aware planner with continuous minimal migration algorithm
🔎 Similar Papers
No similar papers found.
W
Wendong Xu
Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong
C
Chujie Chen
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
H
He Xiao
Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong
Kuan Li
Kuan Li
Hong Kong University of Science and Technology (HKUST)
LLM agentmachine learning on graphsadversarial robustness
J
Jing Xiong
Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong
C
Chen Zhang
Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong
Wenyong Zhou
Wenyong Zhou
The University of Hong Kong
Computer Vision
C
Chaofan Tao
Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong
Y
Yang Bai
Department of Computer Science and Engineering, Chinese University of Hong Kong, Hong Kong
B
Bei Yu
Department of Computer Science and Engineering, Chinese University of Hong Kong, Hong Kong
N
Ngai Wong
Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong