Rotated Robustness: A Training-Free Defense against Bit-Flip Attacks on Large Language Models

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work reveals that large language models are highly vulnerable to hardware bit-flip attacks due to the spatial alignment between activation outliers and sensitive weight bits—a previously unexplored fragility mechanism. To address this, the authors propose a training-free geometric smoothing defense that applies an orthogonal Householder transformation to rotate the activation space, thereby disrupting the alignment between outliers and vulnerable weights. This approach provides theoretically grounded robustness with negligible storage and inference overhead. Empirically, it reduces the random bit-flip crash rate from 3.15% to 0.00% on Qwen2.5-7B and achieves a 43.9% MMLU accuracy on Llama-2-7B under strong targeted attacks—nearly matching the original model’s performance—while increasing the complexity of single-point fault attacks to over 17,000 precise bit flips.

Technology Category

Application Category

📝 Abstract

Hardware faults, specifically bit-flips in quantized weights, pose a severe reliability threat to Large Language Models (LLMs), often triggering catastrophic model collapses. We demonstrate that this vulnerability fundamentally stems from the spatial alignment between sensitive weight bits and extreme activation outliers, which causes a single hardware fault to be massively amplified. To address this, we propose Rotated Robustness (RoR), a training-free defense utilizing orthogonal Householder transformations. By applying an orthogonal rotation to the activation space, RoR geometrically smooths extreme outliers across all feature dimensions. This mechanism effectively breaks the alignment between outliers and vulnerable weights, mathematically guaranteeing original model accuracy. Extensive empirical evaluations across Llama-2/3, OPT, and Qwen families demonstrate the superior reliability of our approach. Under random bit-flip attacks, RoR reduces the stochastic collapse rate from 3.15\% to 0.00\% on Qwen2.5-7B. Furthermore, under severe targeted attacks with 50 Progressive Bit Search flips, RoR sustains robust reasoning on Llama-2-7B, maintaining a 43.9\% MMLU accuracy that nearly matches its 45.2\% unattacked accuracy, while competing defenses collapse to random guessing. Most notably, against the Single-Point Fault Attack (SPFA) -- the most aggressive targeted threat -- RoR exponentially inflates the attack complexity from a few bits to over 17,000 precise bit-flips. With a negligible storage overhead of 0.31\% and a minimal inference latency increase of 9.1\% on Llama-2-7B, RoR achieves true lossless robustness, providing a practical and highly reliable defense for LLM deployment.

Problem

Research questions and friction points this paper is trying to address.

bit-flip attacks

Large Language Models

hardware faults

quantized weights

model robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rotated Robustness

bit-flip attacks

Householder transformation