Guarding the Meaning: Self-Supervised Training for Semantic Robustness in Guard Models

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Models exhibit high sensitivity to semantically preserving input rewrites, revealing a fundamental deficiency in semantic robustness. This paper proposes a self-supervised training framework that prioritizes prediction consistency as its core objective. It introduces a skew-aware aggregation mechanism to jointly model the bidirectional relationship between calibration and consistency, and constructs label-free self-supervised signals via rewrite ensembles. Crucially, the method explicitly incorporates semantic stability into the training objective—rather than relying on superficial features. Evaluated on six open-source guard models, our approach reduces semantic variability by 58% on average, improves accuracy by 2.5 percentage points, and enhances calibration by up to 40%. Moreover, it generalizes effectively to unseen linguistic style variations, significantly improving model reliability and safety.

Technology Category

Application Category

📝 Abstract
Guard models are a critical component of LLM safety, but their sensitivity to superficial linguistic variations remains a key vulnerability. We show that even meaning-preserving paraphrases can cause large fluctuations in safety scores, revealing a lack of semantic grounding. To address this, we introduce a practical, self-supervised framework for improving the semantic robustness of guard models. Our method leverages paraphrase sets to enforce prediction consistency using a novel, skew-aware aggregation strategy for robust target computation. Notably, we find that standard aggregation methods like mean and median can degrade safety, underscoring the need for skew-aware alternatives. We analyze six open-source guard models and show that our approach reduces semantic variability across paraphrases by ~58%, improves benchmark accuracy by ~2.5% on average, and generalizes to unseen stylistic variations. Intriguingly, we discover a bidirectional relationship between model calibration and consistency: our robustness training improves calibration by up to 40%, revealing a fundamental connection between these properties. These results highlight the value of treating semantic consistency as a first-class training objective and provide a scalable recipe for building more reliable guard models.
Problem

Research questions and friction points this paper is trying to address.

Guard models show sensitivity to superficial linguistic variations despite meaning preservation
Standard aggregation methods degrade safety by lacking semantic grounding
Existing models lack robustness against paraphrases and stylistic variations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised framework enhances semantic robustness
Skew-aware aggregation ensures prediction consistency
Paraphrase sets reduce semantic variability significantly