🤖 AI Summary
Human evaluation of open-domain AI applications—such as travel planning and clinical note generation—suffers from sparse feedback, high latency, and prohibitive costs. To address this, we propose MetricBank, a low-data-dependency automatic evaluation framework that pioneers a joint “retrieval-augmented + LLM-as-a-judge” modeling paradigm: it retrieves highly relevant candidate metrics from the MetricBank metric repository and fine-tunes a lightweight multi-metric regression model using minimal human feedback (<100 annotations), yielding an interpretable, human-aligned surrogate reward. Evaluated across five diverse open-domain tasks, MetricBank achieves up to 33.4% higher Kendall correlation with human judgments than standalone LLM-based evaluators, significantly outperforming existing automated evaluation approaches. We publicly release the full toolkit alongside MetricBank—a curated repository of 48 high-quality, domain-agnostic evaluation metrics.
📝 Abstract
Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.