RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation

📅 2026-01-13

📈 Citations: 1

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Large language models (LLMs) often exhibit unreliable performance as automated evaluators due to prompt sensitivity, unverifiable reasoning, and misalignment with human rating scales. To address these limitations, this work proposes compiling natural language scoring rubrics into executable specifications, integrating structured decoding, deterministic evidence anchoring, and a lightweight Wasserstein-based post-calibration mechanism—all without updating model parameters. This approach yields stable, auditable evaluations that significantly improve agreement with human judgments on essay and summarization tasks, demonstrate strong robustness against adversarial perturbations, and enable smaller models to match or even surpass the evaluation performance of much larger counterparts.

Technology Category

Application Category

📝 Abstract

The LLM-as-a-Judge paradigm promises scalable rubric-based evaluation, yet aligning frozen black-box models with human standards remains a challenge due to inherent generation stochasticity. We reframe judge alignment as a criteria transfer problem and isolate three recurrent failure modes: rubric instability caused by prompt sensitivity, unverifiable reasoning that lacks auditable evidence, and scale misalignment with human grading boundaries. To address these issues, we introduce RULERS (Rubric Unification, Locking, and Evidence-anchored Robust Scoring), a compiler-executor framework that transforms natural language rubrics into executable specifications. RULERS operates by compiling criteria into versioned immutable bundles, enforcing structured decoding with deterministic evidence verification, and applying lightweight Wasserstein-based post-hoc calibration, all without updating model parameters. Extensive experiments on essay and summarization benchmarks demonstrate that RULERS significantly outperforms representative baselines in human agreement, maintains strong stability against adversarial rubric perturbations, and enables smaller models to rival larger proprietary judges. Overall, our results suggest that reliable LLM judging requires executable rubrics, verifiable evidence, and calibrated scales rather than prompt phrasing alone. Code is available at https://github.com/LabRAI/Rulers.git.

Problem

Research questions and friction points this paper is trying to address.

LLM-as-a-Judge

rubric alignment

evaluation robustness

evidence verification

scale calibration

Innovation

Methods, ideas, or system contributions that make the work stand out.

executable rubrics

evidence-anchored scoring

structured decoding