🤖 AI Summary
To address the insufficient LLM integration and weak evaluation infrastructure in argument summarization (ArgSum), this paper proposes the first LLM-native end-to-end framework. On the generation side, it introduces a task-aware prompting strategy to enhance summary relevance, faithfulness, and structural coherence. On the evaluation side, it pioneers an interpretable, reproducible prompt-based automatic evaluation paradigm and constructs the first human-annotated benchmark dataset specifically for ArgSum—featuring multi-dimensional quality annotations (e.g., factual consistency, argument coverage, logical flow). Experiments demonstrate that our approach achieves state-of-the-art performance in both generation and evaluation, significantly outperforming traditional supervised models. By unifying generation and assessment within an LLM-centric architecture, this work advances ArgSum from feature-engineering paradigms toward LLM-driven methodologies.
📝 Abstract
Large Language Models (LLMs) have revolutionized various Natural Language Generation (NLG) tasks, including Argument Summarization (ArgSum), a key subfield of Argument Mining (AM). This paper investigates the integration of state-of-the-art LLMs into ArgSum, including for its evaluation. In particular, we propose a novel prompt-based evaluation scheme, and validate it through a novel human benchmark dataset. Our work makes three main contributions: (i) the integration of LLMs into existing ArgSum frameworks, (ii) the development of a new LLM-based ArgSum system, benchmarked against prior methods, and (iii) the introduction of an advanced LLM-based evaluation scheme. We demonstrate that the use of LLMs substantially improves both the generation and evaluation of argument summaries, achieving state-of-the-art results and advancing the field of ArgSum.