AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Human evaluation of open-domain AI applications—such as travel planning and clinical note generation—suffers from sparse feedback, high latency, and prohibitive costs. To address this, we propose MetricBank, a low-data-dependency automatic evaluation framework that pioneers a joint “retrieval-augmented + LLM-as-a-judge” modeling paradigm: it retrieves highly relevant candidate metrics from the MetricBank metric repository and fine-tunes a lightweight multi-metric regression model using minimal human feedback (<100 annotations), yielding an interpretable, human-aligned surrogate reward. Evaluated across five diverse open-domain tasks, MetricBank achieves up to 33.4% higher Kendall correlation with human judgments than standalone LLM-based evaluators, significantly outperforming existing automated evaluation approaches. We publicly release the full toolkit alongside MetricBank—a curated repository of 48 high-quality, domain-agnostic evaluation metrics.

Technology Category

Application Category

📝 Abstract
Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.
Problem

Research questions and friction points this paper is trying to address.

Generates automatic evaluators for AI applications
Improves correlation with human judgments using limited feedback
Provides interpretable metrics for open-ended domain evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically generates LLM-as-a-Judge evaluation criteria
Combines retrieval from curated MetricBank with regression composition
Synthesizes interpretable metrics from minimal human feedback points
🔎 Similar Papers
No similar papers found.