Becoming Experienced Judges: Selective Test-Time Learning for Evaluators

📅 2025-12-07

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Current LLM-as-a-judge approaches suffer from two key limitations: (1) evaluators process samples in isolation, preventing experience accumulation; and (2) they rely on static prompts, hindering adaptation to sample heterogeneity. This paper proposes the Learning While Evaluating (LWE) framework—the first method enabling selective self-evolution *during inference*: it identifies judgment-inconsistent samples via self-feedback and dynamically updates a meta-prompt *only* for difficult cases, thereby generating sample-specific evaluation criteria. Crucially, LWE achieves online experience accumulation and personalized assessment without requiring any training data. Its core innovations are a selective update mechanism and inference-time evolution of the meta-prompt. Evaluated on two pairwise comparison benchmarks, LWE significantly outperforms strong baselines, particularly improving discriminative consistency and robustness on challenging instances.

Technology Category

Application Category

📝 Abstract

Automatic evaluation with large language models, commonly known as LLM-as-a-judge, is now standard across reasoning and alignment tasks. Despite evaluating many samples in deployment, these evaluators typically (i) treat each case independently, missing the opportunity to accumulate experience, and (ii) rely on a single fixed prompt for all cases, neglecting the need for sample-specific evaluation criteria. We introduce Learning While Evaluating (LWE), a framework that allows evaluators to improve sequentially at inference time without requiring training or validation sets. LWE maintains an evolving meta-prompt that (i) produces sample-specific evaluation instructions and (ii) refines itself through self-generated feedback. Furthermore, we propose Selective LWE, which updates the meta-prompt only on self-inconsistent cases, focusing computation where it matters most. This selective approach retains the benefits of sequential learning while being far more cost-effective. Across two pairwise comparison benchmarks, Selective LWE outperforms strong baselines, empirically demonstrating that evaluators can improve during sequential testing with a simple selective update, learning most from the cases they struggle with.

Problem

Research questions and friction points this paper is trying to address.

LLM evaluators lack sequential learning from past cases

Fixed prompts fail to adapt to sample-specific evaluation needs

Evaluators need cost-effective self-improvement during inference without training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective test-time learning for evaluators

Evolving meta-prompt with self-generated feedback

Updates only on self-inconsistent cases

🔎 Similar Papers

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

2024-06-18arXiv.orgCitations: 25

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

Authors to Follow