Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) primarily focus on single-turn visual question answering and struggle to model realistic multi-turn multimodal dialogues. Method: We introduce MMDiag, the first strongly correlated multi-turn multimodal dialogue benchmark, emphasizing image-language joint reasoning and cross-turn visual grounding. Based on it, we propose DiagNote—a novel architecture integrating “deliberative chain-of-reasoning” and “cross-turn gaze modeling” modules to enable fine-grained cross-modal alignment and dynamic visual annotation. Data construction employs a multi-stage rule-based pipeline augmented by GPT assistance; the model incorporates dual-stream interaction, cross-turn attention, and multimodal chain-of-reasoning mechanisms. Results: On MMDiag, DiagNote significantly outperforms mainstream MLLMs in multi-turn visual localization accuracy and cross-turn consistency, achieving an average 23.6% improvement on joint reasoning and region-level grounding tasks—demonstrating the critical role of dynamic visual annotation in sustaining dialogue coherence.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs), built on large-scale pre-trained vision towers and language models, have shown great capabilities in multimodal understanding. However, most existing MLLMs are trained on single-turn vision question-answering tasks, which do not accurately reflect real-world human conversations. In this paper, we introduce MMDiag, a multi-turn multimodal dialogue dataset. This dataset is collaboratively generated through deliberately designed rules and GPT assistance, featuring strong correlations between questions, between questions and images, and among different image regions; thus aligning more closely with real-world scenarios. MMDiag serves as a strong benchmark for multi-turn multimodal dialogue learning and brings more challenges to the grounding and reasoning capabilities of MLLMs. Further, inspired by human vision processing, we present DiagNote, an MLLM equipped with multimodal grounding and reasoning capabilities. DiagNote consists of two modules (Deliberate and Gaze) interacting with each other to perform Chain-of-Thought and annotations respectively, throughout multi-turn dialogues. We empirically demonstrate the advantages of DiagNote in both grounding and jointly processing and reasoning with vision and language information over existing MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Develops MMDiag dataset for multi-turn multimodal dialogue learning
Introduces DiagNote model for enhanced multimodal grounding and reasoning
Addresses limitations of single-turn training in MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

MMDiag dataset for multi-turn multimodal dialogue
DiagNote model with Deliberate and Gaze modules
Chain-of-Thought and annotation in multimodal reasoning
🔎 Similar Papers
No similar papers found.