Judge Circuits

📅 2026-05-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
This study addresses systematic inconsistencies in large language models’ judgments across different output formats—such as 1–5 ratings versus true/false labels—and demonstrates, for the first time, that this phenomenon originates at the neuronal circuit level from a shared sparse latent evaluation subgraph coupled with fragile format-specific output branches. Using Position-aware Edge Attribution Patching (PEAP) and zero-ablation validation, the authors identify a universal Latent Evaluator subgraph within mid-to-late MLP layers and establish that evaluative functionality is architecturally disentangled from output formatting. Experiments across Gemma-3, Qwen2.5, and Llama-3 successfully isolate format-invariant preference signals, revealing that cross-format discrepancies arise primarily from the geometric properties of format-translating components rather than from variations in underlying judgment quality.
📝 Abstract
LLM-as-a-judge has become the dominant paradigm for grading model outputs at scale, yet the same model assigns systematically different scores when its output format changes (e.g., a 1-5 rating vs. a True/False label). Existing diagnoses of these format-induced inconsistencies stop at the input-output level. Using Position-aware Edge Attribution Patching (PEAP), we causally investigate the internal mechanism in Gemma-3, Qwen2.5, and Llama-3. We find that judgments across structured understanding and open-ended preference tasks share a sparse, generalized Latent Evaluator sub-graph in the mid-to-late multi-layer perceptrons (MLPs); zero-ablating it collapses judgment while preserving world knowledge in architecturally modular models. By structurally decoupling abstract judging from output formatting, we provide a mechanistic account of format-induced inconsistency on the open-weight models we study: a continuous judgment signal computed in the shared trunk is mapped through fragile, format-specific terminal branches, enabling format-independent preference to be isolated downstream of the requested output format. Our findings imply that benchmark-level reliability comparisons across formats are partially measuring formatter geometry rather than evaluation quality.
Problem

Research questions and friction points this paper is trying to address.

LLM-as-a-judge
format-induced inconsistency
latent evaluator
output formatting
judgment reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Evaluator
format-induced inconsistency
Position-aware Edge Attribution Patching
mechanistic interpretability
modular judgment