🤖 AI Summary
Assessing factual consistency of hotel highlight summaries generated by large language models (LLMs) in real-world settings remains challenging, as conventional automatic metrics perform poorly in open-domain, high-stakes commercial applications. Method: We construct a human-annotated benchmark and systematically compare lexical-overlap metrics, trainable models, and LLM-based evaluators. Contribution/Results: Simple n-gram overlap metrics—particularly ROUGE-L—exhibit strong correlation with human judgments (Spearman ρ = 0.63), significantly outperforming more complex approaches. In contrast, LLM-based evaluators suffer from prompt-induced bias, undermining annotation reliability. We propose a novel span-level error taxonomy, identifying “factual errors” and “unverifiable statements” as the highest commercial-risk error types. Our findings demonstrate that lightweight, lexically grounded fidelity assessment is not only feasible but also more robust and practically deployable than sophisticated alternatives.
📝 Abstract
We examine evaluation of faithfulness to input data in the context of hotel highlights: brief LLM-generated summaries that capture unique features of accommodations. Through human evaluation campaigns involving categorical error assessment and span-level annotation, we compare traditional metrics, trainable methods, and LLM-as-a-judge approaches. Our findings reveal that simpler metrics like word overlap correlate surprisingly well with human judgments (Spearman correlation rank of 0.63), often outperforming more complex methods when applied to out-of-domain data. We further demonstrate that while LLMs can generate high-quality highlights, they prove unreliable for evaluation as they tend to severely under- or over-annotate. Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks. We also highlight challenges in crowdsourced evaluations.