Towards Context-Invariant Safety Alignment for Large Language Models

πŸ“… 2026-05-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

195K/year
πŸ€– AI Summary
This work addresses the inconsistency in safety behaviors of large language models when confronted with semantically equivalent yet linguistically diverse harmful requests, which renders them vulnerable to adversarial phrasing. To mitigate this issue, the authors propose a context-invariant safety alignment method based on Anchor Invariance Regularization (AIR). AIR leverages verifiable prompts as anchors and applies unidirectional regularization exclusively to open-ended generative variants, ensuring model responses align with the user’s underlying intent rather than superficial wording. By integrating heterogeneous prompt grouping, group-based preference optimization (e.g., GRPO), and an auxiliary loss with gradient stopping, the approach achieves a 12.71% improvement in intra-group accuracy and a 33.49% gain in out-of-distribution consistency across safety, moral reasoning, and mathematical tasks, substantially enhancing robustness against adversarial formulations.
πŸ“ Abstract
Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.
Problem

Research questions and friction points this paper is trying to address.

context-invariant
safety alignment
large language models
adversarial prompts
preference-based alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

context-invariant alignment
Anchor Invariance Regularization
safety alignment
preference optimization
adversarial robustness
πŸ”Ž Similar Papers
2024-06-20arXiv.orgCitations: 26
Y
Yixu Wang
Fudan University, Shanghai, China; Shanghai Artificial Intelligence Laboratory, Shanghai, China
Y
Yang Yao
Shanghai Artificial Intelligence Laboratory, Shanghai, China
Xin Wang
Xin Wang
Fudan University
Computer VisionTrustworthy ML
Y
Yifeng Gao
Fudan University, Shanghai, China
Y
Yan Teng
Shanghai Artificial Intelligence Laboratory, Shanghai, China
Xingjun Ma
Xingjun Ma
Fudan University
Trustworthy AIMultimodal AIGenerative AIEmbodied AI
Y
Yingchun Wang
Shanghai Artificial Intelligence Laboratory, Shanghai, China