Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit “self-preference bias” when employed as automatic evaluators—systematically overrating their own outputs—undermining fairness and reliability in preference alignment and model routing. This work proposes a training-free, inference-time intervention method that activates steering vectors to mitigate this bias. We are the first to decompose self-preference bias into *reasonable* and *unreasonable* components, constructing a dedicated annotation dataset for fine-grained analysis. Our empirical study reveals its multidimensional, nonlinear nature and strong context dependence of steering vector efficacy. By comparing contextual activation addition (CAA) with optimization-driven steering vector generation, we achieve up to 97% reduction in unreasonable self-preference on preference evaluation tasks—substantially outperforming prompt engineering and direct preference optimization baselines. We further identify intrinsic challenges in calibrating reasonable self-preference, establishing a novel paradigm for trustworthy automatic evaluation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from "self-preference bias": a tendency to favor their own outputs over those of other models. This bias undermines fairness and reliability in evaluation pipelines, particularly for tasks like preference tuning and model routing. We investigate whether lightweight steering vectors can mitigate this problem at inference time without retraining. We introduce a curated dataset that distinguishes self-preference bias into justified examples of self-preference and unjustified examples of self-preference, and we construct steering vectors using two methods: Contrastive Activation Addition (CAA) and an optimization-based approach. Our results show that steering vectors can reduce unjustified self-preference bias by up to 97%, substantially outperforming prompting and direct preference optimization baselines. Yet steering vectors are unstable on legitimate self-preference and unbiased agreement, implying self-preference spans multiple or nonlinear directions. This underscores both their promise and limits as safeguards for LLM-as-judges and motivates more robust interventions.
Problem

Research questions and friction points this paper is trying to address.

Mitigating self-preference bias in LLM evaluators
Reducing unfairness in automated evaluation pipelines
Using steering vectors without model retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight steering vectors mitigate bias
Contrastive Activation Addition reduces self-preference
Optimization-based approach outperforms baseline methods
🔎 Similar Papers
No similar papers found.
J
Jou Barzdukas
Department of Computer Science, University of Virginia
M
Matthew Nguyen
Department of Computer Science, University of Virginia
M
Matthew Bozoukov
Department of Computer Science, University of California, San Diego
S
Simon Fu
School of Computer Science, Carnegie Mellon University
D
Dani Roytburg
School of Computer Science, Carnegie Mellon University
Narmeen Oozeer
Narmeen Oozeer
Research Engineer, Martian Learning
mathematicsdeep learninginterpretability