🤖 AI Summary
Dialogue summarization research suffers from ambiguous task definitions, fragmented understanding of challenges, and inconsistent evaluation practices. Method: We systematically analyze 1,262 English-language dialogue summarization papers published between 2019–2024, leveraging Semantic Scholar/DBLP retrieval, cross-paper thematic coding, and technique–challenge alignment analysis. Contribution/Results: We propose the first unified challenge taxonomy spanning six dimensions—language quality, structural coherence, comprehension, speaker modeling, salience, and factual consistency. Our analysis exposes widespread ROUGE misuse and the absence of standardized human evaluation protocols. We identify SAMSum, AMI, and DialogSum as the three dominant benchmark datasets and empirically confirm persistent bottlenecks in comprehension, factual consistency, and salience modeling. Crucially, we demonstrate that this challenge taxonomy remains highly relevant in the large language model era, providing a foundation for evaluation standardization and fostering a challenge-driven research paradigm.
📝 Abstract
Abstractive dialogue summarization is the task of distilling conversations into informative and concise summaries. Although focused reviews have been conducted on this topic, there is a lack of comprehensive work that details the core challenges of dialogue summarization, unifies the differing understanding of the task, and aligns proposed techniques, datasets, and evaluation metrics with the challenges. This article summarizes the research on Transformer-based abstractive summarization for English dialogues by systematically reviewing 1262 unique research papers published between 2019 and 2024, relying on the Semantic Scholar and DBLP databases. We cover the main challenges present in dialog summarization (i.e., language, structure, comprehension, speaker, salience, and factuality) and link them to corresponding techniques such as graph-based approaches, additional training tasks, and planning strategies, which typically overly rely on BART-based encoder-decoder models. Recent advances in training methods have led to substantial improvements in language-related challenges. However, challenges such as comprehension, factuality, and salience remain difficult and present significant research opportunities. We further investigate how these approaches are typically analyzed, covering the datasets for the subdomains of dialogue (e.g., meeting, customer service, and medical), the established automatic metrics (e.g., ROUGE), and common human evaluation approaches for assigning scores and evaluating annotator agreement. We observe that only a few datasets (i.e., SAMSum, AMI, DialogSum) are widely used. Despite its limitations, the ROUGE metric is the most commonly used, while human evaluation, considered the gold standard, is frequently reported without sufficient detail on the inter-annotator agreement and annotation guidelines. Additionally, we discuss the possible implications of the recently explored large language models and conclude that our described challenge taxonomy remains relevant despite a potential shift in relevance and difficulty.