🤖 AI Summary
This paper addresses the challenge of instance-level performance prediction for long-text generation tasks. We propose a task-, model-, and metric-agnostic black-box method that predicts multi-dimensional, fine-grained quality scores—such as factuality and coherence—as well as their predictive intervals, solely from input text and model output, thereby quantifying prediction uncertainty. Our key contributions are: (1) the first benchmark for instance-level performance prediction targeting multi-dimensional quality assessment; (2) a few-shot generalizable framework requiring only 16 annotated instances to achieve robust cross-task (11 tasks), cross-model (multiple LLMs), and cross-metric prediction; and (3) a unified architecture jointly modeling continuous-score regression and uncertainty estimation. Experiments demonstrate significant improvements over baselines. We release both an off-the-shelf tool and the open-source benchmark, advancing the practicality and trustworthiness of generative model performance forecasting.
📝 Abstract
We motivate and share a new benchmark for instance-level performance prediction of long-form generation tasks having multi-faceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples. Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.