CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Multi-image understanding faces challenges in cross-image visual comparison and retaining key conceptual knowledge. Method: This paper proposes a vision-dominated, human-like slow-thinking framework. It introduces the first multimodal interleaved reasoning chain, supervised by visual region tokens, and integrates a test-time memory-augmented module to enable interpretable, parameter-efficient multi-step cross-image reasoning. Additionally, we construct the first benchmark dataset specifically designed for slow-thinking multi-image reasoning. Contributions/Results: Our work breaks away from the prevailing text-dominant paradigm, empirically validating the efficacy of vision-driven slow thinking. The reasoning process features explicit region-level alignment and dynamic concept memory. On our newly established benchmark, the framework achieves significant improvements in both cross-image reasoning accuracy and interpretability, establishing a novel paradigm for collaborative multi-image reasoning.

Technology Category

Application Category

📝 Abstract

While previous multimodal slow-thinking methods have demonstrated remarkable success in single-image understanding scenarios, their effectiveness becomes fundamentally constrained when extended to more complex multi-image comprehension tasks. This limitation stems from their predominant reliance on text-based intermediate reasoning processes. While for human, when engaging in sophisticated multi-image analysis, they typically perform two complementary cognitive operations: (1) continuous cross-image visual comparison through region-of-interest matching, and (2) dynamic memorization of critical visual concepts throughout the reasoning chain. Motivated by these observations, we propose the Complex Multi-Modal Chain-of-Thought (CMMCoT) framework, a multi-step reasoning framework that mimics human-like"slow thinking"for multi-image understanding. Our approach incorporates two key innovations: 1. The construction of interleaved multimodal multi-step reasoning chains, which utilize critical visual region tokens, extracted from intermediate reasoning steps, as supervisory signals. This mechanism not only facilitates comprehensive cross-modal understanding but also enhances model interpretability. 2. The introduction of a test-time memory augmentation module that expands the model reasoning capacity during inference while preserving parameter efficiency. Furthermore, to facilitate research in this direction, we have curated a novel multi-image slow-thinking dataset. Extensive experiments demonstrate the effectiveness of our model.

Problem

Research questions and friction points this paper is trying to address.

Enhances multi-image comprehension via multi-modal reasoning.

Introduces memory augmentation for improved inference capacity.

Addresses limitations of text-based reasoning in multi-image tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaved multimodal multi-step reasoning chains

Test-time memory augmentation module

Novel multi-image slow-thinking dataset

🔎 Similar Papers

No similar papers found.