When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the “fragile safety” of language models, which mechanically adhere to original safety rules even when contextual shifts invert the safety implications of their actions. To systematically evaluate robustness in dynamic scenarios, we introduce a context-flipping assessment framework that constructs paired examples with reversed safety outcomes. Our analysis reveals, for the first time, a substantial gap—averaging 17.4 percentage points—between models’ safety reasoning and commonsense understanding, demonstrating that this fragility stems from insufficient policy coverage rather than misinterpretation. To mitigate this, we propose a state-aware verification mechanism that replaces conventional action-level safeguards. Evaluated on the PacifAIst benchmark and catastrophic consequence probes, our approach achieves 100% risk detection with zero false positives, whereas existing safeguards completely fail.
📝 Abstract
Safety benchmark scores provide incomplete evidence of deployment readiness: aligned language models often adhere to rigid rules even when a situational update flips which action is safe. We term this failure brittle safety. To diagnose it, we introduce context-flip evaluation, testing 12 models across a safety benchmark (PacifAIst) and two commonsense controls using paired variants where the nominally safe action produces harm. Three findings emerge. First, brittle safety is safety-specific: all 12 models exhibit a safety-commonsense gap (mean +17.4 pp). Baseline accuracy fails to predict brittleness: among models above 90% baseline accuracy, brittleness rates range from 13.7% to 90.0%. Second, failures stem from policy override rather than miscomprehension: despite acknowledging the context change in every case, models persist via three distinct mechanisms that vary by update type and model family. Third, on a hand-audited probe of catastrophic consequence-flip scenarios, standard action-level guardrails catch none, while a state-aware validator catches all without false alarms on correct interventions. This indicates that action-level content moderation is systematically blind to consequence-flips, motivating state-aware architectural alternatives. We release our protocol, perturbed benchmarks, and deployment probe.
Problem

Research questions and friction points this paper is trying to address.

brittle safety
context-flip
aligned language models
safety benchmark
consequence-flip
Innovation

Methods, ideas, or system contributions that make the work stand out.

brittle safety
context-flip evaluation
state-aware validation
safety-commonsense gap
consequence-flip
🔎 Similar Papers
2024-06-20arXiv.orgCitations: 26