Exploring Silent Data Corruption as a Reliability Challenge in LLM Training

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the vulnerability of large language model pretraining to silent data corruption (SDC) induced by hardware faults, which can cause anomalous gradients, abrupt loss spikes, and even training divergence. The study presents the first systematic characterization of SDC sensitivity across bit positions, operators, and training phases, and introduces a lightweight detection and rollback-recovery mechanism. By injecting faults at the GPU matrix multiplication instruction level and dynamically monitoring gradients and loss alongside NaN propagation analysis, the method identifies harmful parameter updates and triggers step-level rollback with recomputation. Evaluated on LLaMA models ranging from 60M to 1.3B parameters, the approach effectively captures and mitigates SDC-induced training anomalies, significantly enhancing training stability.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) scale in size and complexity, the consequences of failures during training become increasingly severe. A major challenge arises from Silent Data Corruption (SDC): hardware-induced faults that bypass system-level detection mechanisms. SDC may behave like benign numerical noise, but can also cause harmful gradient corruption that leads to loss spikes, divergence, or stalled progress. This work provides a controlled study of how intermittent SDC affects LLM pretraining. Using targeted fault injection at the level of GPU matrix-multiply instructions, we characterize the sensitivity of different bit positions, kernel functions, and execution stages. Our analysis shows that locally originating faults can produce impactful corruption, including NaN propagation, short-lived spikes in loss, gradient norm, and attention logits, as well as persistent parameter divergence. Building on the observed corruption signatures, we propose a lightweight detection method that identifies potentially harmful parameter updates. Experiments on LLaMA models with 60M, 350M, and 1.3B parameters demonstrate that recomputing the most recent training step upon detection can effectively mitigate the impact of these events.

Problem

Research questions and friction points this paper is trying to address.

Silent Data Corruption

Large Language Models

Training Reliability

Hardware Faults

Gradient Corruption

Innovation

Methods, ideas, or system contributions that make the work stand out.

Silent Data Corruption

LLM Training Reliability

Fault Injection