Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

This work proposes CFA², a novel jailbreaking framework for large language models (LLMs) that treats safety alignment mechanisms as unobserved confounders suppressing the model’s inherent capabilities. For the first time, the front-door criterion from causal inference is introduced to LLM jailbreaking. The approach leverages sparse autoencoders (SAEs) to disentangle safety-related features and applies deterministic interventions to block confounding pathways, enabling interpretable, mechanism-based, and robust attacks. By reducing the high-dimensional marginalization problem to low-complexity interventions, CFA² achieves state-of-the-art attack success rates across multiple mainstream LLMs while uncovering the internal causal pathways through which safety mechanisms operate.

Technology Category

Application Category

📝 Abstract

Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model's inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the Causal Front-Door Adjustment Attack (CFA{$^2$}) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that CFA{$^2$} achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.

Problem

Research questions and friction points this paper is trying to address.

LLM safety

jailbreak attacks

latent confounder

causal inference

alignment mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Inference

Front-Door Adjustment

Jailbreak Attack