Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work proposes a novel “representation-as-evaluator” paradigm to address the limitations of conventional large language models (LLMs) as evaluators—namely high computational cost, poor interpretability, and sensitivity to prompt variations. Instead of relying on generated outputs for evaluation, the approach probes semantic signals embedded in the hidden states of small-to-medium-sized language models. Grounded in the hypothesis of asymmetric semantic capacity across model layers, the authors introduce the INSPECTOR framework, which employs lightweight probes to directly predict fine-grained evaluation scores from intermediate representations without requiring autoregressive decoding. Experiments demonstrate that this method significantly outperforms existing small-model evaluation techniques on reasoning benchmarks such as GSM8K, MATH, and GPQA, achieving performance comparable to full-scale LLM evaluators while offering superior efficiency, robustness, and interpretability.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are widely used as reference-free evaluators via prompting, but this"LLM-as-a-Judge"paradigm is costly, opaque, and sensitive to prompt design. In this work, we investigate whether smaller models can serve as efficient evaluators by leveraging internal representations instead of surface generation. We uncover a consistent empirical pattern: small LMs, despite with weak generative ability, encode rich evaluative signals in their hidden states. This motivates us to propose the Semantic Capacity Asymmetry Hypothesis: evaluation requires significantly less semantic capacity than generation and can be grounded in intermediate representations, suggesting that evaluation does not necessarily need to rely on large-scale generative models but can instead leverage latent features from smaller ones. Our findings motivate a paradigm shift from LLM-as-a-Judge to Representation-as-a-Judge, a decoding-free evaluation strategy that probes internal model structure rather than relying on prompted output. We instantiate this paradigm through INSPECTOR, a probing-based framework that predicts aspect-level evaluation scores from small model representations. Experiments on reasoning benchmarks (GSM8K, MATH, GPQA) show that INSPECTOR substantially outperforms prompting-based small LMs and closely approximates full LLM judges, while offering a more efficient, reliable, and interpretable alternative for scalable evaluation.

Problem

Research questions and friction points this paper is trying to address.

LLM-as-a-Judge

evaluation

semantic capacity

small language models

representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Representation-as-a-Judge

Semantic Capacity Asymmetry

Probing-based Evaluation