From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of large language models to jailbreaking under adversarial prefix attacks, which stems from semantic representation degradation caused by superficial safety alignment. To mitigate this, the authors propose a two-stage causal GRPO framework (TSC-GRPO), which introduces causal identifiability into alignment training for the first time. By employing a causal intent probe to disentangle invariant intent from stylistic perturbations, and integrating group relative policy optimization with cumulative causal penalties during “bifurcated-path” training, the model internalizes causal awareness—learning that cumulative harmful tokens monotonically reduce reward, thereby enabling robust late-stage refusal. The proposed “intent pinning” mechanism effectively alleviates representation decay, significantly enhancing jailbreak resistance while preserving general capabilities.

Technology Category

Application Category

📝 Abstract
Large Language Models remain vulnerable to adversarial prefix attacks (e.g., ``Sure, here is'') despite robust standard safety. We diagnose this vulnerability as Shallow Safety Alignment, stemming from a pathology we term semantic representation decay: as the model generates compliant prefixes, its internal malicious intent signal fades. To address this, we propose Two-Stage Causal-GRPO (TSC-GRPO), a framework designed to achieve intent pinning. First, grounded in causal identifiability theory, we train a causal intent probe to disentangle invariant intent from stylistic perturbations. Second, we internalize this causal awareness into the policy via Group Relative Policy Optimization. By employing a cumulative causal penalty within ``fork-in-the-road'' training scenarios, we force the model to learn that accumulating harmful tokens monotonically decreases reward, enabling robust late-stage refusals. Experiments show that TSC-GRPO significantly outperforms baselines in defending against jailbreak attacks while preserving general utility.
Problem

Research questions and friction points this paper is trying to address.

adversarial prefix attacks
Shallow Safety Alignment
semantic representation decay
jailbreak attacks
malicious intent
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal GRPO
Intent Pinning
Semantic Representation Decay
Adversarial Prefix Attacks
Two-Stage Causal-GRPO
🔎 Similar Papers
No similar papers found.
S
Shuyi Zhou
University of Chinese Academy of Sciences, Beijing, China
Zeen Song
Zeen Song
Institute of Software Chinese Academy of Sciences
Machine Learning
Wenwen Qiang
Wenwen Qiang
Institute of Software, Chinese Academy of Sciences
Artificial IntelligenceMachine LearningCausal InferenceLLM/MLLM
J
Jiyan Sun
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Y
Yao Zhou
University of Chinese Academy of Sciences, Beijing, China
Y
Yinlong Liu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
W
Wei Ma
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China