Towards Understanding the Robustness of Sparse Autoencoders

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

224K/year
🤖 AI Summary
This work addresses the vulnerability of large language models to optimization-based jailbreaking attacks and investigates the role of sparse autoencoders (SAEs) in enhancing robustness. The authors propose embedding a pretrained SAE into the Transformer residual stream during inference—without modifying model weights or blocking gradients—to improve defense capabilities. Experimental results reveal a monotonic negative correlation between L0 sparsity and attack success rate, and demonstrate that intermediate SAE layers effectively balance robustness and task performance, supporting the representational bottleneck hypothesis. Across multiple mainstream models, the SAE-augmented approach reduces jailbreak success rates by up to fivefold and significantly diminishes the transferability of cross-model attacks.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, their robustness implications remain underexplored. We present a study of integrating pretrained SAEs into transformer residual streams at inference time, without modifying model weights or blocking gradients. Across four model families (Gemma, LLaMA, Mistral, Qwen) and two strong white-box attacks (GCG, BEAST) plus three black-box benchmarks, SAE-augmented models achieve up to a 5x reduction in jailbreak success rate relative to the undefended baseline and reduce cross-model attack transferability. Parametric ablations reveal (i) a monotonic dose-response relationship between L0 sparsity and attack success rate, and (ii) a layer-dependent defense-utility tradeoff, where intermediate layers balance robustness and clean performance. These findings are consistent with a representational bottleneck hypothesis: sparse projection reshapes the optimization geometry exploited by jailbreak attacks.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
jailbreak attacks
Sparse Autoencoders
robustness
gradient-based attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders
Jailbreak Robustness
Representational Bottleneck
L0 Sparsity
Gradient-based Attacks