Instance-level Performance Prediction for Long-form Generation Tasks

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This paper addresses the challenge of instance-level performance prediction for long-text generation tasks. We propose a task-, model-, and metric-agnostic black-box method that predicts multi-dimensional, fine-grained quality scores—such as factuality and coherence—as well as their predictive intervals, solely from input text and model output, thereby quantifying prediction uncertainty. Our key contributions are: (1) the first benchmark for instance-level performance prediction targeting multi-dimensional quality assessment; (2) a few-shot generalizable framework requiring only 16 annotated instances to achieve robust cross-task (11 tasks), cross-model (multiple LLMs), and cross-metric prediction; and (3) a unified architecture jointly modeling continuous-score regression and uncertainty estimation. Experiments demonstrate significant improvements over baselines. We release both an off-the-shelf tool and the open-source benchmark, advancing the practicality and trustworthiness of generative model performance forecasting.

Technology Category

Application Category

📝 Abstract

We motivate and share a new benchmark for instance-level performance prediction of long-form generation tasks having multi-faceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples. Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.

Problem

Research questions and friction points this paper is trying to address.

Predicting continuous evaluation metric scores for long-form generation tasks

Inferring prediction intervals to quantify uncertainty in metric estimates

Developing task- and model-agnostic performance prediction using minimal training examples

Innovation

Methods, ideas, or system contributions that make the work stand out.

Black-box performance prediction for generation tasks

Uncertainty quantification via prediction intervals

Few-shot learning with minimal training examples

🔎 Similar Papers

No similar papers found.