MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Current vision-language models are constrained by single-turn, single-image dialogue paradigms, limiting their capacity for multi-image, multi-turn contextual reasoning required in realistic human-AI interaction. To address this, we introduce MMCR-310k—the first large-scale, multi-image, multi-turn visual dialogue dataset—and its companion diagnostic benchmark, MMCR-Bench. We formally define a novel “multi-image–multi-turn–topic-focused” dialogue paradigm, spanning eight domains and forty fine-grained subtopics. Methodologically, our approach integrates multi-image alignment instruction tuning, cross-turn context modeling, and domain-aware prompt engineering. Experimental results demonstrate that our fine-tuned models achieve a 5.2% absolute gain in contextual accuracy on MMCR-Bench and yield an average +1.17% improvement across established multimodal benchmarks—including AI2D, MMMU, and MMVet—thereby significantly advancing multimodal contextual reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Compared to single-turn dialogue, multi-turn dialogue involving multiple images better aligns with the needs of real-world human-AI interactions. Additionally, as training data, it provides richer contextual reasoning information, thereby guiding the model to achieve better performance. However, existing vision-language models (VLMs) primarily rely on single-turn dialogue training and evaluation benchmarks. In this paper, following the characteristics of human dialogue, such as focused topics and concise, clear content, we present MMCR (Multimodal Multi-turn Contextual Reasoning), a novel dataset comprising: (1) MMCR-310k -- the largest multi-image multi-turn instruction tuning dataset with 310K contextual dialogues, each covering 1-4 images and 4 or 8 dialogue turns; and (2) MMCR-Bench -- a diagnostic benchmark featuring dialogues, spanning 8 domains (Humanities, Natural, Science, Education, etc.) and 40 sub-topics. Extensive evaluations demonstrate that models fine-tuned with MMCR-310k achieve 5.2% higher contextual accuracy on MMCR-Bench, while showing consistent improvements on existing benchmarks (+1.1% on AI2D, +1.2% on MMMU and MMVet). MMCR and prompt engineering will be released publicly.

Problem

Research questions and friction points this paper is trying to address.

Enhancing visual language models for multi-turn dialogues

Addressing lack of multi-image contextual reasoning benchmarks

Improving model performance with rich multi-turn training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest multi-image multi-turn dataset MMCR-310k

Diagnostic benchmark MMCR-Bench across 8 domains

Improved contextual accuracy by 5.2 percent

🔎 Similar Papers

No similar papers found.