I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study investigates the security implications of embedding drift induced by model updates, revealing that even minor perturbations—of magnitude as low as 0.02—in instruction-tuned models can catastrophically degrade the performance of frozen-embedding safety classifiers, reducing ROC-AUC from 85% to near-random levels (50%). Alarmingly, 72% of these failures occur with high confidence, rendering them silent yet severe threats to AI system integrity. The work further demonstrates that instruction tuning diminishes inter-class separability by 20% compared to the base model, paradoxically exacerbating safety challenges despite alignment objectives. Through normalized perturbation analysis, confidence profiling, and separability metrics, this research uncovers a subtle yet devastating failure mode, offering critical insights for robust safety monitoring in evolving language models.

Technology Category

Application Category

📝 Abstract

Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $σ=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85\%$ to $50\%$ ROC-AUC. Critically, mean confidence only drops $14\%$, producing dangerous silent failures where $72\%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$\%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.

Problem

Research questions and friction points this paper is trying to address.

safety classifiers

embedding drift

silent failures

model updates

representation stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

embedding drift

safety classifiers

silent failure