MeCeFO: Enhancing LLM Training Robustness via Fault-Tolerant Optimization

📅 2025-10-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Frequent hardware failures during distributed training of large language models (LLMs) severely degrade training stability and resource utilization, while existing fault-tolerance mechanisms incur substantial computational and memory overhead. Method: This paper proposes an efficient fault-tolerant optimization framework integrating three novel techniques: (1) skip connections to bypass multi-head attention (MHA) backward propagation for accelerated recovery; (2) on-the-fly re-computation of feed-forward network (FFN) activation values to reduce GPU memory footprint; and (3) low-rank gradient approximation to accelerate weight updates. At the algorithmic level, it enables seamless task migration to neighboring nodes upon failure, preserving theoretical convergence at O(1/√T), matching standard distributed SGD. Contribution/Results: Experiments demonstrate only a 4.18% throughput degradation under high failure rates, with fault-tolerance efficiency 5.0–6.7× higher than state-of-the-art methods—significantly improving training robustness and hardware utilization.

Technology Category

Application Category

📝 Abstract
As distributed optimization scales to meet the demands of Large Language Model (LLM) training, hardware failures become increasingly non-negligible. Existing fault-tolerant training methods often introduce significant computational or memory overhead, demanding additional resources. To address this challenge, we propose Memory- and Computation-efficient Fault-tolerant Optimization (MeCeFO), a novel algorithm that ensures robust training with minimal overhead. When a computing node fails, MeCeFO seamlessly transfers its training task to a neighboring node while employing memory- and computation-efficient algorithmic optimizations to minimize the extra workload imposed on the neighboring node handling both tasks. MeCeFO leverages three key algorithmic designs: (i) Skip-connection, which drops the multi-head attention (MHA) module during backpropagation for memory- and computation-efficient approximation; (ii) Recomputation, which reduces activation memory in feedforward networks (FFNs); and (iii) Low-rank gradient approximation, enabling efficient estimation of FFN weight matrix gradients. Theoretically, MeCeFO matches the convergence rate of conventional distributed training, with a rate of $mathcal{O}(1/sqrt{nT})$, where n is the data parallelism size and T is the number of iterations. Empirically, MeCeFO maintains robust performance under high failure rates, incurring only a 4.18% drop in throughput, demonstrating 5.0$ imes$ to 6.7$ imes$ greater resilience than previous SOTA approaches. Codes are available at https://github.com/pkumelon/MeCeFO.
Problem

Research questions and friction points this paper is trying to address.

Addressing hardware failures in distributed LLM training
Reducing computational and memory overhead in fault tolerance
Maintaining robust training performance under node failures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fault-tolerant optimization for robust LLM training
Memory-efficient skip-connection and recomputation techniques
Low-rank gradient approximation for computational efficiency
🔎 Similar Papers
No similar papers found.
R
Rizhen Hu
Peking University
Y
Yutong He
Peking University
Ran Yan
Ran Yan
University of California, Los Angeles
Medical Imaging
M
Mou Sun
Zhejiang Lab
B
Binhang Yuan
HKUST
K
Kun Yuan
Peking University