Towards Context-Invariant Safety Alignment for Large Language Models

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the inconsistency in safety behaviors of large language models when confronted with semantically equivalent yet linguistically diverse harmful requests, which renders them vulnerable to adversarial phrasing. To mitigate this issue, the authors propose a context-invariant safety alignment method based on Anchor Invariance Regularization (AIR). AIR leverages verifiable prompts as anchors and applies unidirectional regularization exclusively to open-ended generative variants, ensuring model responses align with the user’s underlying intent rather than superficial wording. By integrating heterogeneous prompt grouping, group-based preference optimization (e.g., GRPO), and an auxiliary loss with gradient stopping, the approach achieves a 12.71% improvement in intra-group accuracy and a 33.49% gain in out-of-distribution consistency across safety, moral reasoning, and mathematical tasks, substantially enhancing robustness against adversarial formulations.

📝 Abstract

Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.

Problem

Research questions and friction points this paper is trying to address.

context-invariant

safety alignment

large language models

adversarial prompts

preference-based alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

context-invariant alignment

Anchor Invariance Regularization

safety alignment