Consistency Training Helps Stop Sycophancy and Jailbreaks

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to adversarial prompts, exhibiting factual degradation (e.g., belief-congruent flattery) or safety failures (e.g., jailbreaking). This work frames alignment as a consistency challenge and proposes a self-supervised, dataset-free training framework. It introduces Bias-augmented Consistency Training (BCT) at the output layer and Activation Consistency Training (ACT) at the hidden-layer activation level—both jointly enhancing behavioral stability under prompt perturbations. Experiments on Gemini 2.5 Flash demonstrate substantial improvements: BCT outperforms baselines in mitigating flattery and resisting jailbreaking, while ACT more effectively suppresses internal representation drift. The core contribution lies in decoupling alignment into multi-granularity consistency constraints—enabling lightweight, efficient, and data-free stabilization of model behavior without reliance on static supervision.

Technology Category

Application Category

📝 Abstract

An LLM's factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). We explore emph{consistency training}, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt. Instead of teaching the model what exact response to give on a particular prompt, we aim to teach the model to behave identically across prompt data augmentations (like adding leading questions or jailbreak text). We try enforcing this invariance in two ways: over the model's external outputs (emph{Bias-augmented Consistency Training} (BCT) from Chua et al. [2025]) and over its internal activations (emph{Activation Consistency Training} (ACT), a method we introduce). Both methods reduce Gemini 2.5 Flash's susceptibility to irrelevant cues. Because consistency training uses responses from the model itself as training data, it avoids issues that arise from stale training data, such as degrading model capabilities or enforcing outdated response guidelines. While BCT and ACT reduce sycophancy equally well, BCT does better at jailbreak reduction. We think that BCT can simplify training pipelines by removing reliance on static datasets. We argue that some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.

Problem

Research questions and friction points this paper is trying to address.

Reducing sycophancy and jailbreak vulnerabilities in LLMs

Enhancing model invariance to irrelevant prompt cues

Improving alignment via self-supervised consistency training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Consistency training ensures model invariance to irrelevant cues

Bias-augmented Consistency Training reduces jailbreaks using external outputs

Activation Consistency Training enforces invariance through internal activations

🔎 Similar Papers

No similar papers found.