MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

πŸ“… 2025-10-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing automatic evaluation methods for multimodal dialogue summarization (MDS) lack a reliable human-annotated meta-benchmark. Method: This paper introduces MDSEval, the first meta-evaluation benchmark for MDS, encompassing image-text dialogues, human-written summaries, and eight-dimensional fine-grained quality annotations. It innovatively defines MDS-specific evaluation dimensions and proposes the Modality-Exclusive Key Information (MEKI) frameworkβ€”a novel cross-modal data filtering approach that enhances inter-modal alignment accuracy and annotation consistency. Contribution/Results: Through systematic evaluation of state-of-the-art multimodal evaluators, we reveal significant bias and insufficient discriminative power when assessing advanced multimodal large language model (MLLM)-generated summaries. MDSEval establishes an authoritative, standardized benchmark for MDS evaluation, uncovers critical limitations in current methodologies, and advances trustworthy multimodal evaluation research.

Technology Category

Application Category

πŸ“ Abstract
Multimodal Dialogue Summarization (MDS) is a critical task with wide-ranging applications. To support the development of effective MDS models, robust automatic evaluation methods are essential for reducing both cost and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations. In this work, we introduce MDSEval, the first meta-evaluation benchmark for MDS, consisting image-sharing dialogues, corresponding summaries, and human judgments across eight well-defined quality aspects. To ensure data quality and richfulness, we propose a novel filtering framework leveraging Mutually Exclusive Key Information (MEKI) across modalities. Our work is the first to identify and formalize key evaluation dimensions specific to MDS. We benchmark state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various bias.
Problem

Research questions and friction points this paper is trying to address.

Establishing the first meta-evaluation benchmark for multimodal dialogue summarization
Proposing a filtering framework to ensure high-quality multimodal data
Evaluating state-of-the-art methods' limitations and biases in MDS assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes MEKI filtering framework for data quality
Introduces first meta-evaluation benchmark for MDS
Identifies and formalizes key MDS evaluation dimensions
πŸ”Ž Similar Papers
No similar papers found.