π€ AI Summary
Existing automatic evaluation methods for multimodal dialogue summarization (MDS) lack a reliable human-annotated meta-benchmark. Method: This paper introduces MDSEval, the first meta-evaluation benchmark for MDS, encompassing image-text dialogues, human-written summaries, and eight-dimensional fine-grained quality annotations. It innovatively defines MDS-specific evaluation dimensions and proposes the Modality-Exclusive Key Information (MEKI) frameworkβa novel cross-modal data filtering approach that enhances inter-modal alignment accuracy and annotation consistency. Contribution/Results: Through systematic evaluation of state-of-the-art multimodal evaluators, we reveal significant bias and insufficient discriminative power when assessing advanced multimodal large language model (MLLM)-generated summaries. MDSEval establishes an authoritative, standardized benchmark for MDS evaluation, uncovers critical limitations in current methodologies, and advances trustworthy multimodal evaluation research.
π Abstract
Multimodal Dialogue Summarization (MDS) is a critical task with wide-ranging applications. To support the development of effective MDS models, robust automatic evaluation methods are essential for reducing both cost and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations. In this work, we introduce MDSEval, the first meta-evaluation benchmark for MDS, consisting image-sharing dialogues, corresponding summaries, and human judgments across eight well-defined quality aspects. To ensure data quality and richfulness, we propose a novel filtering framework leveraging Mutually Exclusive Key Information (MEKI) across modalities. Our work is the first to identify and formalize key evaluation dimensions specific to MDS. We benchmark state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various bias.