🤖 AI Summary
This study investigates the security implications of embedding drift induced by model updates, revealing that even minor perturbations—of magnitude as low as 0.02—in instruction-tuned models can catastrophically degrade the performance of frozen-embedding safety classifiers, reducing ROC-AUC from 85% to near-random levels (50%). Alarmingly, 72% of these failures occur with high confidence, rendering them silent yet severe threats to AI system integrity. The work further demonstrates that instruction tuning diminishes inter-class separability by 20% compared to the base model, paradoxically exacerbating safety challenges despite alignment objectives. Through normalized perturbation analysis, confidence profiling, and separability metrics, this research uncovers a subtle yet devastating failure mode, offering critical insights for robust safety monitoring in evolving language models.
📝 Abstract
Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $σ=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85\%$ to $50\%$ ROC-AUC. Critically, mean confidence only drops $14\%$, producing dangerous silent failures where $72\%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$\%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.