🤖 AI Summary
Existing automatic summarization evaluation metrics struggle to assess the logical faithfulness of parliamentary debate summaries, thereby hindering public understanding of policy reasoning. This work proposes a formal evaluation framework grounded in computational argumentation, introducing for the first time the alignment of argumentative structures into this task. By anchoring on contested proposals, the framework examines whether summaries generated by large language models accurately preserve the core reasoning chains supporting or opposing a policy. Moving beyond superficial semantic similarity, the approach emphasizes logical consistency and has been validated on European Parliament debates. It establishes a novel, interpretable, and structure-aware paradigm for evaluating faithfulness in political text summarization.
📝 Abstract
Understanding how policy is debated and justified in parliament is a fundamental aspect of the democratic process. However, the volume and complexity of such debates mean that outside audiences struggle to engage. Meanwhile, Large Language Models (LLMs) have been shown to enable automated summarisation at scale. While summaries of debates can make parliamentary procedures more accessible, evaluating whether these summaries faithfully communicate argumentative content remains challenging. Existing automated summarisation metrics have been shown to correlate poorly with human judgements of consistency (i.e., faithfulness or alignment between summary and source). In this work, we propose a formal framework for evaluating parliamentary debate summaries that grounds argument structures in the contested proposals up for debate. Our novel approach, driven by computational argumentation, focuses the evaluation on formal properties concerning the faithful preservation of the reasoning presented to justify or oppose policy outcomes. We demonstrate our methods using a case-study of debates from the European Parliament and associated LLM-driven summaries.