Reference-Free Rating of LLM Responses via Latent Information

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the instability, poor calibration, lack of fine-grained resolution, and low determinacy of large language models (LLMs) acting as evaluators in reference-free settings. We propose a novel evaluation framework grounded in internal latent representations of LLMs. Our method is the first to systematically characterize two critical phenomena in LLM scoring: compression effects and score drift. By integrating probabilistic weighted integer scoring, verification-style “yes/no” judgments, and linear probes trained on hidden-layer activations, it extracts robust, interpretable, fine-grained scores directly from model-internal signals. Evaluated across multiple single-response and pairwise benchmarks, our approach significantly outperforms conventional prompting-based methods. It further improves accuracy and ranking consistency in downstream tasks—including Best-of-N selection and multi-teacher distillation—demonstrating strong generalization. This work establishes a new paradigm for reference-free automated evaluation of LLM-generated outputs.

Technology Category

Application Category

📝 Abstract

How reliable are single-response LLM-as-a-judge ratings without references, and can we obtain fine-grained, deterministic scores in this setting? We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses and show two systematic issues: scores are unstable under sampling and poorly calibrated, leading to compression near the top of the scale and frequent ties. We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals: (i) probability-weighted scores over integer ratings, (ii) verifier-style probabilities of "yes", and (iii) linear probes trained on model activations at the rating position. Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting, with consistent gains on pairwise accuracy and listwise ranking relevant to Best-of-N selection. Probability-weighted scores achieve the strongest single-rating correlations, while probes recover useful signals when output logits are miscalibrated. These results indicate that latent information provides deterministic and more discriminative signals for reference-free evaluation, and can improve selection and training approaches like Best-of-$N$, multi-teacher distillation, and routing.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM response reliability without reference texts

Addressing unstable and poorly calibrated Likert-scale ratings

Developing deterministic scoring using latent model signals

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probability-weighted scores over integer ratings

Verifier-style probabilities of yes answers

Linear probes trained on model activations

🔎 Similar Papers

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions