Instance-level Performance Prediction for Long-form Generation Tasks

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of instance-level performance prediction for long-text generation tasks. We propose a task-, model-, and metric-agnostic black-box method that predicts multi-dimensional, fine-grained quality scores—such as factuality and coherence—as well as their predictive intervals, solely from input text and model output, thereby quantifying prediction uncertainty. Our key contributions are: (1) the first benchmark for instance-level performance prediction targeting multi-dimensional quality assessment; (2) a few-shot generalizable framework requiring only 16 annotated instances to achieve robust cross-task (11 tasks), cross-model (multiple LLMs), and cross-metric prediction; and (3) a unified architecture jointly modeling continuous-score regression and uncertainty estimation. Experiments demonstrate significant improvements over baselines. We release both an off-the-shelf tool and the open-source benchmark, advancing the practicality and trustworthiness of generative model performance forecasting.

Technology Category

Application Category

📝 Abstract
We motivate and share a new benchmark for instance-level performance prediction of long-form generation tasks having multi-faceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples. Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.
Problem

Research questions and friction points this paper is trying to address.

Predicting continuous evaluation metric scores for long-form generation tasks
Inferring prediction intervals to quantify uncertainty in metric estimates
Developing task- and model-agnostic performance prediction using minimal training examples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Black-box performance prediction for generation tasks
Uncertainty quantification via prediction intervals
Few-shot learning with minimal training examples
🔎 Similar Papers
No similar papers found.
C
Chi-Yang Hsu
The University of Texas at Austin, Austin, TX, USA
A
Alexander Braylan
The University of Texas at Austin, Austin, TX, USA
Yiheng Su
Yiheng Su
University of Texas at Austin
Omar Alonso
Omar Alonso
Amazon
Information RetrievalEvaluationLabelingKnowledge graphs
M
Matthew Lease
The University of Texas at Austin, Austin, TX, USA