When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the “fragile safety” of language models, which mechanically adhere to original safety rules even when contextual shifts invert the safety implications of their actions. To systematically evaluate robustness in dynamic scenarios, we introduce a context-flipping assessment framework that constructs paired examples with reversed safety outcomes. Our analysis reveals, for the first time, a substantial gap—averaging 17.4 percentage points—between models’ safety reasoning and commonsense understanding, demonstrating that this fragility stems from insufficient policy coverage rather than misinterpretation. To mitigate this, we propose a state-aware verification mechanism that replaces conventional action-level safeguards. Evaluated on the PacifAIst benchmark and catastrophic consequence probes, our approach achieves 100% risk detection with zero false positives, whereas existing safeguards completely fail.

📝 Abstract

Safety benchmark scores provide incomplete evidence of deployment readiness: aligned language models often adhere to rigid rules even when a situational update flips which action is safe. We term this failure brittle safety. To diagnose it, we introduce context-flip evaluation, testing 12 models across a safety benchmark (PacifAIst) and two commonsense controls using paired variants where the nominally safe action produces harm. Three findings emerge. First, brittle safety is safety-specific: all 12 models exhibit a safety-commonsense gap (mean +17.4 pp). Baseline accuracy fails to predict brittleness: among models above 90% baseline accuracy, brittleness rates range from 13.7% to 90.0%. Second, failures stem from policy override rather than miscomprehension: despite acknowledging the context change in every case, models persist via three distinct mechanisms that vary by update type and model family. Third, on a hand-audited probe of catastrophic consequence-flip scenarios, standard action-level guardrails catch none, while a state-aware validator catches all without false alarms on correct interventions. This indicates that action-level content moderation is systematically blind to consequence-flips, motivating state-aware architectural alternatives. We release our protocol, perturbed benchmarks, and deployment probe.

Problem

Research questions and friction points this paper is trying to address.

brittle safety

context-flip

aligned language models

safety benchmark

consequence-flip

Innovation

Methods, ideas, or system contributions that make the work stand out.

brittle safety

context-flip evaluation

state-aware validation