UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pairwise evaluation of large language models (LLMs) suffers from preference bias, causing inter-judge ranking inconsistency. To address this, we propose an unsupervised debiased alignment framework that empirically identifies heterogeneous bias across model evaluations for the first time and introduces a consensus alignment mechanism minimizing Elo trajectory dispersion—enabling dynamic, annotation-free calibration of multi-judge scores. Our method employs a compact neural network to adaptively modulate the Elo K-factor and win-probability estimation, reducing inter-judge score standard deviation by up to 63.4% and improving average correlation with human judgments by 24.7%. It notably enhances low-performing judges’ reliability and promotes evaluation fairness. The core innovation lies in formalizing judge consistency as a trajectory alignment problem and establishing a theoretically grounded, unsupervised debiasing paradigm.

Technology Category

Application Category

📝 Abstract
Pairwise evaluation of Large Language Models (LLMs) is a common paradigm, but it is prone to preference bias, where judges systematically favor certain outputs, such as their own. This bias leads to inconsistent and skewed rankings across different judges. To address this, we first empirically demonstrate significant and heterogeneous biases in cross-model evaluations. We then propose UDA (Unsupervised Debiasing Alignment), a framework that reduces inter-judge disagreement by dynamically adjusting the Elo rating system. For each pairwise comparison, a compact neural network learns to adaptively set the K-factor and refine win probabilities. Crucially, UDA operates in a fully unsupervised manner, guided solely by the objective of minimizing the dispersion among the Elo trajectories of all judges. This forces an alignment towards a collective consensus, which serves as an unsupervised proxy for a more stable and reproducible evaluation. In addition, we provide theoretical motivation demonstrating how alignment towards a consensus can reduce aggregate system bias. Experiments show that UDA significantly reduces the inter-judge rating standard deviation by up to 63.4% and improves the average correlation with human judgments by 24.7%. Notably, UDA elevates the performance of poorly performing judges to achieve parity with high-quality ones, fostering a more robust and reliable evaluation ecosystem. Code and data are available at https://anonymous.4open.science/r/62AB93CD-23B4.
Problem

Research questions and friction points this paper is trying to address.

Addresses preference bias in pairwise LLM evaluations
Reduces inter-judge disagreement through dynamic Elo adjustment
Achieves unsupervised alignment towards collective consensus ranking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised framework dynamically adjusts Elo rating system
Compact neural network adaptively sets K-factor probabilities
Minimizes judge dispersion to force consensus alignment
Y
Yang Zhang
Beijing Knowledge Atlas Technology Joint Stock
Cunxiang Wang
Cunxiang Wang
Tsinghua University; ZhipuAI
Large Language ModelsLLM EvaluationLLM Post-training
L
Lindong Wu
Beijing Knowledge Atlas Technology Joint Stock
W
Wenbo Yu
Beijing Knowledge Atlas Technology Joint Stock
Y
Yidong Wang
Beijing Knowledge Atlas Technology Joint Stock
Guangsheng Bao
Guangsheng Bao
Ph.D. Candidate, Westlake University & Zhejiang University.
ReasoningLarge Language ModelNatural Language Generation
Jie Tang
Jie Tang
UW Madison
Computed Tomography