🤖 AI Summary
This work addresses a critical gap in automatic dialogue summarization evaluation: the neglect of dialogue-specific structural transitions and shifts in narrative perspective, coupled with the absence of a fine-grained, hierarchical error analysis framework. To this end, the authors propose the first two-tier error taxonomy for dialogue summarization—distinguishing between dialogue-level and intra-turn errors—and systematically annotate and analyze issues such as information omission, perspective bias, and hallucination. Leveraging a high-quality human-annotated dataset, the study empirically uncovers recurrent error patterns, including the frequent omission of mid-dialogue content and the prevalence of extrinsic hallucinations toward the summary’s end. Experimental results demonstrate the robustness of the proposed classification scheme and reveal that current large language models still face significant challenges in detecting these nuanced, fine-grained errors.
📝 Abstract
Dialogues are a predominant mode of communication for humans, and it is immensely helpful to have automatically generated summaries of them (e.g., to revise key points discussed in a meeting, to review conversations between customer agents and product users). Prior works on dialogue summary evaluation largely ignore the complexities specific to this task: (i) shift in structure, from multiple speakers discussing information in a scattered fashion across several turns, to a summary's sentences, and (ii) shift in narration viewpoint, from speakers'first/second-person narration, standardized third-person narration in the summary. In this work, we introduce our framework DIALSUMMER to address the above. We propose DIAL-SUMMER's taxonomy of errors to comprehensively evaluate dialogue summaries at two hierarchical levels: DIALOGUE-LEVEL that focuses on the broader speakers/turns, and WITHIN-TURN-LEVEL that focuses on the information talked about inside a turn. We then present DIAL-SUMMER's dataset composed of dialogue summaries manually annotated with our taxonomy's fine-grained errors. We conduct empirical analyses of these annotated errors, and observe interesting trends (e.g., turns occurring in middle of the dialogue are the most frequently missed in the summary, extrinsic hallucinations largely occur at the end of the summary). We also conduct experiments on LLM-Judges'capability at detecting these errors, through which we demonstrate the challenging nature of our dataset, the robustness of our taxonomy, and the need for future work in this field to enhance LLMs'performance in the same. Code and inference dataset coming soon.