🤖 AI Summary
Models exhibit high sensitivity to semantically preserving input rewrites, revealing a fundamental deficiency in semantic robustness. This paper proposes a self-supervised training framework that prioritizes prediction consistency as its core objective. It introduces a skew-aware aggregation mechanism to jointly model the bidirectional relationship between calibration and consistency, and constructs label-free self-supervised signals via rewrite ensembles. Crucially, the method explicitly incorporates semantic stability into the training objective—rather than relying on superficial features. Evaluated on six open-source guard models, our approach reduces semantic variability by 58% on average, improves accuracy by 2.5 percentage points, and enhances calibration by up to 40%. Moreover, it generalizes effectively to unseen linguistic style variations, significantly improving model reliability and safety.
📝 Abstract
Guard models are a critical component of LLM safety, but their sensitivity to superficial linguistic variations remains a key vulnerability. We show that even meaning-preserving paraphrases can cause large fluctuations in safety scores, revealing a lack of semantic grounding. To address this, we introduce a practical, self-supervised framework for improving the semantic robustness of guard models. Our method leverages paraphrase sets to enforce prediction consistency using a novel, skew-aware aggregation strategy for robust target computation. Notably, we find that standard aggregation methods like mean and median can degrade safety, underscoring the need for skew-aware alternatives. We analyze six open-source guard models and show that our approach reduces semantic variability across paraphrases by ~58%, improves benchmark accuracy by ~2.5% on average, and generalizes to unseen stylistic variations. Intriguingly, we discover a bidirectional relationship between model calibration and consistency: our robustness training improves calibration by up to 40%, revealing a fundamental connection between these properties. These results highlight the value of treating semantic consistency as a first-class training objective and provide a scalable recipe for building more reliable guard models.