Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes CFA², a novel jailbreaking framework for large language models (LLMs) that treats safety alignment mechanisms as unobserved confounders suppressing the model’s inherent capabilities. For the first time, the front-door criterion from causal inference is introduced to LLM jailbreaking. The approach leverages sparse autoencoders (SAEs) to disentangle safety-related features and applies deterministic interventions to block confounding pathways, enabling interpretable, mechanism-based, and robust attacks. By reducing the high-dimensional marginalization problem to low-complexity interventions, CFA² achieves state-of-the-art attack success rates across multiple mainstream LLMs while uncovering the internal causal pathways through which safety mechanisms operate.

Technology Category

Application Category

📝 Abstract
Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model's inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the Causal Front-Door Adjustment Attack (CFA{$^2$}) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that CFA{$^2$} achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.
Problem

Research questions and friction points this paper is trying to address.

LLM safety
jailbreak attacks
latent confounder
causal inference
alignment mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Inference
Front-Door Adjustment
Jailbreak Attack
Sparse Autoencoders
LLM Safety
🔎 Similar Papers
No similar papers found.
Y
Yao Zhou
Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Zeen Song
Zeen Song
Institute of Software Chinese Academy of Sciences
Machine Learning
Wenwen Qiang
Wenwen Qiang
Institute of Software, Chinese Academy of Sciences
Artificial IntelligenceMachine LearningCausal InferenceLLM/MLLM
F
Fengge Wu
Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
S
Shuyi Zhou
Institute of Information Engineering Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Changwen Zheng
Changwen Zheng
中国科学院软件研究所
机器学习、计算机仿真
Hui Xiong
Hui Xiong
Senior Scientist, Candela Corporation
Ultrafast dynamicsatomic molecular physicsfree electron laser