🤖 AI Summary
This work proposes CFA², a novel jailbreaking framework for large language models (LLMs) that treats safety alignment mechanisms as unobserved confounders suppressing the model’s inherent capabilities. For the first time, the front-door criterion from causal inference is introduced to LLM jailbreaking. The approach leverages sparse autoencoders (SAEs) to disentangle safety-related features and applies deterministic interventions to block confounding pathways, enabling interpretable, mechanism-based, and robust attacks. By reducing the high-dimensional marginalization problem to low-complexity interventions, CFA² achieves state-of-the-art attack success rates across multiple mainstream LLMs while uncovering the internal causal pathways through which safety mechanisms operate.
📝 Abstract
Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model's inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the Causal Front-Door Adjustment Attack (CFA{$^2$}) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that CFA{$^2$} achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.