On Jailbreaking Quantized Language Models Through Fault Injection Attacks

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work investigates the vulnerability of language model safety alignment mechanisms under low-precision quantization, specifically examining jailbreak success rates under fault-injection attacks. We propose a gradient-guided attack framework operating at both bit-level and token-level, integrating progressive bit-flipping with single-weight update strategies. We systematically evaluate attack robustness across four quantization formats: FP16, FP8, INT8, and INT4. Results show FP8 significantly enhances resilience—jailbreak rates remain below 20% after 25 bit-flips and under 65% even after 150 flips—whereas FP16 is highly susceptible (>80% jailbreak rate). Crucially, we identify for the first time cross-format transferability of jailbreak behavior from FP16 to FP8/INT8, which is markedly suppressed in INT4. Our study establishes a fundamental link between quantization format and model security, providing both theoretical insights and practical guidelines for secure quantized deployment of large language models.

Technology Category

Application Category

📝 Abstract

The safety alignment of Language Models (LMs) is a critical concern, yet their integrity can be challenged by direct parameter manipulation attacks, such as those potentially induced by fault injection. As LMs are increasingly deployed using low-precision quantization for efficiency, this paper investigates the efficacy of such attacks for jailbreaking aligned LMs across different quantization schemes. We propose gradient-guided attacks, including a tailored progressive bit-level search algorithm introduced herein and a comparative word-level (single weight update) attack. Our evaluation on Llama-3.2-3B, Phi-4-mini, and Llama-3-8B across FP16 (baseline), and weight-only quantization (FP8, INT8, INT4) reveals that quantization significantly influences attack success. While attacks readily achieve high success (>80% Attack Success Rate, ASR) on FP16 models, within an attack budget of 25 perturbations, FP8 and INT8 models exhibit ASRs below 20% and 50%, respectively. Increasing the perturbation budget up to 150 bit-flips, FP8 models maintained ASR below 65%, demonstrating some resilience compared to INT8 and INT4 models that have high ASR. In addition, analysis of perturbation locations revealed differing architectural targets across quantization schemes, with (FP16, INT4) and (INT8, FP8) showing similar characteristics. Besides, jailbreaks induced in FP16 models were highly transferable to subsequent FP8/INT8 quantization (<5% ASR difference), though INT4 significantly reduced transferred ASR (avg. 35% drop). These findings highlight that while common quantization schemes, particularly FP8, increase the difficulty of direct parameter manipulation jailbreaks, vulnerabilities can still persist, especially through post-attack quantization.

Problem

Research questions and friction points this paper is trying to address.

Investigates fault injection attacks on quantized language models

Evaluates attack success across different quantization schemes

Analyzes transferability of jailbreaks between quantization types

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-guided attacks on quantized LMs

Progressive bit-level search algorithm

Comparative word-level single weight update

🔎 Similar Papers

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation