🤖 AI Summary
This study addresses the challenge of accurately measuring event information coverage in automatic news summarization evaluation. We propose a novel evaluation method based on event overlap, shifting from conventional paradigms that rely on lexical or sentence-level overlap or similarity scores. Instead, our approach treats events as the fundamental evaluation unit: it jointly extracts and semantically matches structured events from generated summaries, reference summaries, and source articles, augmented by human-annotated event labels to ensure semantic fidelity. Experiments on a Norwegian news dataset demonstrate that the method significantly improves discrimination of summary event preservation and factual accuracy, while enhancing interpretability and reliability of evaluation outcomes. The core contribution is the first principled integration of event modeling as the central semantic unit for summarization quality assessment—enabling a paradigm shift from shallow surface matching to deep semantic coverage evaluation.
📝 Abstract
An abstractive summary of a news article contains its most important information in a condensed version. The evaluation of automatically generated summaries by generative language models relies heavily on human-authored summaries as gold references, by calculating overlapping units or similarity scores. News articles report events, and ideally so should the summaries. In this work, we propose to evaluate the quality of abstractive summaries by calculating overlapping events between generated summaries, reference summaries, and the original news articles. We experiment on a richly annotated Norwegian dataset comprising both events annotations and summaries authored by expert human annotators. Our approach provides more insight into the event information contained in the summaries.