Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

📅 2026-02-14

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work identifies and formalizes a previously unrecognized vulnerability in large language model (LLM) evaluation: "rubric-induced preference drift" (RIPD), wherein seemingly innocuous modifications to natural language scoring rubrics can systematically skew judgments in target domains without triggering standard benchmark alarms. The authors propose a rubric-based preference attack framework that leverages controlled rubric editing, domain-specific evaluation, and downstream policy fine-tuning to expose the manipulability of high-level alignment interfaces. Experimental results demonstrate that such attacks can degrade judgment accuracy by 9.5% on helpfulness and 27.9% on harmlessness metrics. Critically, the induced biases propagate through preference data into aligned models, leading to persistent behavioral deviations that compromise alignment integrity.

Technology Category

Application Category

📝 Abstract

Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target-domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies. This leads to persistent and systematic drift in model behavior. Overall, our findings highlight evaluation rubrics as a sensitive and manipulable control interface, revealing a system-level alignment risk that extends beyond evaluator reliability alone. The code is available at: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface. Warning: Certain sections may contain potentially harmful content that may not be appropriate for all readers.

Problem

Research questions and friction points this paper is trying to address.

Rubric-Induced Preference Drift

LLM Judges

Preference Attacks

Alignment Pipelines

Evaluation Vulnerability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rubric-Induced Preference Drift

LLM Judges

Preference Attacks