STORYSUMM: Evaluating Faithfulness in Story Summarization

πŸ“… 2024-07-09
πŸ›οΈ Conference on Empirical Methods in Natural Language Processing
πŸ“ˆ Citations: 3
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses the challenge of evaluating faithfulness in narrative text summarization by introducing StorySumm, the first fine-grained benchmark dedicated to detecting latent inconsistencies and errors in abstractive summaries. Methodologically, it constructs a dataset of short stories paired with LLM-generated summaries, annotated manually with localized error types, precise error spans, and semantic attributions; it further proposes the first explainable, span-level faithfulness evaluation framework tailored to narrative domains. Key contributions are threefold: (1) it exposes systematic blind spots in single-annotator human evaluation protocols and advocates a multi-source ground-truth fusion paradigm; (2) it provides the first faithfulness evaluation resource with explicit error localization and semantic attribution; and (3) empirical results show that state-of-the-art automatic metrics achieve at most 70% balanced accuracy, confirming StorySumm’s rigor as a challenging new benchmark.

Technology Category

Application Category

πŸ“ Abstract
Human evaluation has been the gold standard for checking faithfulness in abstractive summarization. However, with a challenging source domain like narrative, multiple annotators can agree a summary is faithful, while missing details that are obvious errors only once pointed out. We therefore introduce a new dataset, StorySumm, comprising LLM summaries of short stories with localized faithfulness labels and error explanations. This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies. Using this dataset, we first show that any one human annotation protocol is likely to miss inconsistencies, and we advocate for pursuing a range of methods when establishing ground truth for a summarization dataset. We finally test recent automatic metrics and find that none of them achieve more than 70% balanced accuracy on this task, demonstrating that it is a challenging benchmark for future work in faithfulness evaluation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating faithfulness in abstractive story summarization
Detecting challenging inconsistencies in narrative summaries
Assessing automatic metrics for faithfulness evaluation accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces STORYSUMM dataset for faithfulness evaluation
Uses localized labels and error explanations
Tests automatic metrics with challenging benchmark
πŸ”Ž Similar Papers
No similar papers found.