Calibrating Model-Based Evaluation Metrics for Summarization

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

This work addresses the limitations of existing reference-free summarization evaluation methods, which often suffer from inadequate calibration and reliance on human annotations or large language models, thereby failing to reliably reflect true summary quality. The authors propose a general framework that requires neither reference summaries nor human labels, capable of producing proxy scores for both individual and average summary quality. Central to this approach is Group Isotonic Regression Binning (GIRB), a novel calibration technique designed for continuous-valued tasks, which—used for the first time in a reference-free setting—enables high-quality proxy scoring. Experiments across seven datasets demonstrate that the proposed method significantly outperforms current baselines, substantially improving the reliability and generalizability of evaluation metrics, with straightforward extension to discrete tasks such as question answering.

Technology Category

Application Category

📝 Abstract

Recent advances in summary evaluation are based on model-based metrics to assess quality dimensions, such as completeness, conciseness, and faithfulness. However, these methods often require large language models, and predicted scores are frequently miscalibrated, limiting their reliability. Moreover, evaluating the average quality across different summaries for a single document typically requires access to multiple reference summaries. Here, we propose a general framework that generates individual and average proxy scores without relying on reference summaries, human annotations, or expensive model-based metrics. We also propose group isotonic regression binning (GIRB), a calibration method that adjusts the raw predictions to better align with ground-truth evaluation metrics. While we focus on continuous-value scenarios, such as summarization, the method is applicable to discrete-value tasks, such as question answering. Experiments on seven datasets demonstrate that our approach consistently outperforms existing baselines.

Problem

Research questions and friction points this paper is trying to address.

summary evaluation

model-based metrics

miscalibration

reference-free evaluation

calibration

Innovation

Methods, ideas, or system contributions that make the work stand out.

model-based evaluation

calibration

reference-free summarization