Dependence-Aware Label Aggregation for LLM-as-a-Judge via Ising Models

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the limitations of traditional label aggregation methods, such as Dawid-Skene, when large language models (LLMs) serve as annotators. The core issue lies in the violation of the conditional independence assumption among annotators—often induced by shared data, architectures, or prompts—which leads to biased posterior estimates and erroneous judgments. To tackle this, the paper introduces an Ising graphical model into the LLM adjudication setting, proposing a dependency-aware aggregation framework that explicitly captures both class-dependent and class-independent couplings among annotators. By integrating Bayesian log-odds inference with a correlation-corrected weighted voting mechanism, the approach effectively mitigates excess risk. Theoretical analysis demonstrates that neglecting such dependencies yields suboptimal performance in both finite-sample and asymptotic regimes. Empirical evaluations on three real-world datasets consistently outperform classical baselines, underscoring the efficacy and necessity of modeling annotator dependencies.

Technology Category

Application Category

📝 Abstract

Large-scale AI evaluation increasingly relies on aggregating binary judgments from $K$ annotators, including LLMs used as judges. Most classical methods, e.g., Dawid-Skene or (weighted) majority voting, assume annotators are conditionally independent given the true label $Y\in\{0,1\}$, an assumption often violated by LLM judges due to shared data, architectures, prompts, and failure modes. Ignoring such dependencies can yield miscalibrated posteriors and even confidently incorrect predictions. We study label aggregation through a hierarchy of dependence-aware models based on Ising graphical models and latent factors. For class-dependent Ising models, the Bayes log-odds is generally quadratic in votes; for class-independent couplings, it reduces to a linear weighted vote with correlation-adjusted parameters. We present finite-$K$ examples showing that methods based on conditional independence can flip the Bayes label despite matching per-annotator marginals. We prove separation results demonstrating that these methods remain strictly suboptimal as the number of judges grows, incurring nonvanishing excess risk under latent factors. Finally, we evaluate the proposed method on three real-world datasets, demonstrating improved performance over the classical baselines.

Problem

Research questions and friction points this paper is trying to address.

label aggregation

LLM-as-a-Judge

annotator dependence

conditional independence

Ising models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ising model

label aggregation

LLM-as-a-Judge