Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing large language models (LLMs) serve effectively as automatic evaluators in pairwise comparison settings but struggle to assign interpretable, absolute scores to individual NLG outputs (e.g., summaries, dialogues), limiting their utility in threshold-based applications. To address this, we propose a novel direct-scoring NLG evaluation framework that, at inference time, introduces synthetically generated summaries as virtual contrastive samples—thereby embedding pairwise ranking into the direct-scoring paradigm. Our method integrates prompt engineering and synthetic data construction without requiring additional model training. Evaluated on SummEval, TopicalChat, and HANNA benchmarks, it achieves axis-averaged sample-level Pearson correlations of 0.42, 0.38, and 0.45, respectively—comparable to state-of-the-art pairwise evaluators (+0.03/−0.03/+0.05). We publicly release our synthetic data to foster further research.

Technology Category

Application Category

📝 Abstract

As large-language models have been increasingly used as automatic raters for evaluating free-form content, including document summarization, dialog, and story generation, work has been dedicated to evaluating such models by measuring their correlations with human judgment. For extit{sample-level} performance, methods which operate by using pairwise comparisons between machine-generated text perform well but often lack the ability to assign absolute scores to individual summaries, an ability crucial for use cases that require thresholding. In this work, we propose a direct-scoring method which uses synthetic summaries to act as pairwise machine rankings at test time. We show that our method performs comparably to state-of-the-art pairwise evaluators in terms of axis-averaged sample-level correlations on the SummEval ( extbf{+0.03}), TopicalChat ( extbf{-0.03}), and HANNA ( extbf{+0.05}) meta-evaluation benchmarks, and release the synthetic in-context summaries as data to facilitate future work.

Problem

Research questions and friction points this paper is trying to address.

Evaluating automatic text generation using pairwise comparisons

Assigning absolute scores to individual text summaries

Improving correlation with human judgment in NLG evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct-scoring method using synthetic summaries

Leverages pairwise comparisons for absolute scoring

Performs comparably to state-of-the-art evaluators

🔎 Similar Papers

Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators