π€ AI Summary
This paper addresses the challenge of evaluating faithfulness in narrative text summarization by introducing StorySumm, the first fine-grained benchmark dedicated to detecting latent inconsistencies and errors in abstractive summaries. Methodologically, it constructs a dataset of short stories paired with LLM-generated summaries, annotated manually with localized error types, precise error spans, and semantic attributions; it further proposes the first explainable, span-level faithfulness evaluation framework tailored to narrative domains. Key contributions are threefold: (1) it exposes systematic blind spots in single-annotator human evaluation protocols and advocates a multi-source ground-truth fusion paradigm; (2) it provides the first faithfulness evaluation resource with explicit error localization and semantic attribution; and (3) empirical results show that state-of-the-art automatic metrics achieve at most 70% balanced accuracy, confirming StorySummβs rigor as a challenging new benchmark.
π Abstract
Human evaluation has been the gold standard for checking faithfulness in abstractive summarization. However, with a challenging source domain like narrative, multiple annotators can agree a summary is faithful, while missing details that are obvious errors only once pointed out. We therefore introduce a new dataset, StorySumm, comprising LLM summaries of short stories with localized faithfulness labels and error explanations. This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies. Using this dataset, we first show that any one human annotation protocol is likely to miss inconsistencies, and we advocate for pursuing a range of methods when establishing ground truth for a summarization dataset. We finally test recent automatic metrics and find that none of them achieve more than 70% balanced accuracy on this task, demonstrating that it is a challenging benchmark for future work in faithfulness evaluation.