Real-World Summarization: When Evaluation Reaches Its Limits

📅 2025-07-15

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Assessing factual consistency of hotel highlight summaries generated by large language models (LLMs) in real-world settings remains challenging, as conventional automatic metrics perform poorly in open-domain, high-stakes commercial applications. Method: We construct a human-annotated benchmark and systematically compare lexical-overlap metrics, trainable models, and LLM-based evaluators. Contribution/Results: Simple n-gram overlap metrics—particularly ROUGE-L—exhibit strong correlation with human judgments (Spearman ρ = 0.63), significantly outperforming more complex approaches. In contrast, LLM-based evaluators suffer from prompt-induced bias, undermining annotation reliability. We propose a novel span-level error taxonomy, identifying “factual errors” and “unverifiable statements” as the highest commercial-risk error types. Our findings demonstrate that lightweight, lexically grounded fidelity assessment is not only feasible but also more robust and practically deployable than sophisticated alternatives.

Technology Category

Application Category

📝 Abstract

We examine evaluation of faithfulness to input data in the context of hotel highlights: brief LLM-generated summaries that capture unique features of accommodations. Through human evaluation campaigns involving categorical error assessment and span-level annotation, we compare traditional metrics, trainable methods, and LLM-as-a-judge approaches. Our findings reveal that simpler metrics like word overlap correlate surprisingly well with human judgments (Spearman correlation rank of 0.63), often outperforming more complex methods when applied to out-of-domain data. We further demonstrate that while LLMs can generate high-quality highlights, they prove unreliable for evaluation as they tend to severely under- or over-annotate. Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks. We also highlight challenges in crowdsourced evaluations.

Problem

Research questions and friction points this paper is trying to address.

Evaluating faithfulness of LLM-generated hotel summaries

Comparing traditional metrics with trainable and LLM-judge methods

Assessing risks of incorrect information in real-world applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human evaluation with categorical error assessment

Comparing traditional metrics and trainable methods

LLM-generated summaries with word overlap metrics

🔎 Similar Papers

Factual consistency evaluation of summarization in the Era of large language models