🤖 AI Summary
This work introduces a hardware-level jailbreaking attack that permanently degrades the alignment constraints of commercially deployed human-aligned language models via fewer than 25 targeted DRAM bit flips—down to just five flipped bits—during inference, enabling harmful content generation without adversarial prompts. The method integrates Rowhammer-based fault injection, precise bit-flip modeling, parameter sensitivity analysis, and an efficient bit-localization algorithm. It is the first practical, low-overhead targeted attack against billion-parameter models; it reveals heterogeneous vulnerability across model components to bit-level corruption—a failure mode fundamentally distinct from all prompt-based jailbreaking approaches. The attack is reproducible across 56 distinct DDR4/LPDDR4X memory configurations and remains effective even on systems with Rowhammer mitigations enhanced by 46×, achieving 20× higher computational efficiency than prior methods.
📝 Abstract
We introduce a new class of attacks on commercial-scale (human-aligned) language models that induce jailbreaking through targeted bitwise corruptions in model parameters. Our adversary can jailbreak billion-parameter language models with fewer than 25 bit-flips in all cases$-$and as few as 5 in some$-$using up to 40$ imes$ less bit-flips than existing attacks on computer vision models at least 100$ imes$ smaller. Unlike prompt-based jailbreaks, our attack renders these models in memory 'uncensored' at runtime, allowing them to generate harmful responses without any input modifications. Our attack algorithm efficiently identifies target bits to flip, offering up to 20$ imes$ more computational efficiency than previous methods. This makes it practical for language models with billions of parameters. We show an end-to-end exploitation of our attack using software-induced fault injection, Rowhammer (RH). Our work examines 56 DRAM RH profiles from DDR4 and LPDDR4X devices with different RH vulnerabilities. We show that our attack can reliably induce jailbreaking in systems similar to those affected by prior bit-flip attacks. Moreover, our approach remains effective even against highly RH-secure systems (e.g., 46$ imes$ more secure than previously tested systems). Our analyses further reveal that: (1) models with less post-training alignment require fewer bit flips to jailbreak; (2) certain model components, such as value projection layers, are substantially more vulnerable than others; and (3) our method is mechanistically different than existing jailbreaks. Our findings highlight a pressing, practical threat to the language model ecosystem and underscore the need for research to protect these models from bit-flip attacks.