Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation

📅 2026-02-07

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study investigates a systematic bias in large language models (LLMs) when evaluating summaries, revealing a consistent preference for human-written summaries—particularly when these exhibit low lexical overlap with reference texts. For the first time, this work correlates LLM evaluation bias with fine-grained overlap between model-generated and human reference summaries, using ROUGE and BLEU metrics. The authors systematically assess nine mainstream LLMs ranging from 1B to 12B parameters, including Gemma 3 and LLaMA 3. Results show that, with only one exception, all models disproportionately favor LLM-generated summaries in low-overlap scenarios and struggle to accurately judge human summaries with minimal textual alignment to references. These findings underscore the limitations of relying solely on LLM-based scoring for summary evaluation and raise critical concerns about the reliability of automated assessment in abstractive summarization.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to paraphrasing. However, LLM judges show biases for length and order among others, and are vulnerable to various adversarial input prompts. While recent studies have looked into these biases, few have analyzed them at a more granular level in relation to a well-defined overlap metric. In this work we provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization. We test 9 recent LLMs with parameter counts ranging from 1 billion to 12 billion, including variants of Gemma 3 and LLaMA 3. We find that LLM judges increasingly prefer summaries generated by other LLMs over those written by humans as the similarities (as measured by ROUGE and BLEU) between the judged summaries decrease, and this pattern extends to all but one model tested, and exists regardless of the models'own position biases. Additionally, we find that models struggle to judge even summaries with limited overlaps, suggesting that LLM-as-a-judge in the summary domain should rely on techniques beyond a simple comparison.

Problem

Research questions and friction points this paper is trying to address.

LLM-as-a-judge

overlap bias

summary evaluation

human-written summaries

evaluation reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-a-judge

overlap bias

summary evaluation