CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

📅 2025-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multi-image understanding faces challenges in cross-image visual comparison and retaining key conceptual knowledge. Method: This paper proposes a vision-dominated, human-like slow-thinking framework. It introduces the first multimodal interleaved reasoning chain, supervised by visual region tokens, and integrates a test-time memory-augmented module to enable interpretable, parameter-efficient multi-step cross-image reasoning. Additionally, we construct the first benchmark dataset specifically designed for slow-thinking multi-image reasoning. Contributions/Results: Our work breaks away from the prevailing text-dominant paradigm, empirically validating the efficacy of vision-driven slow thinking. The reasoning process features explicit region-level alignment and dynamic concept memory. On our newly established benchmark, the framework achieves significant improvements in both cross-image reasoning accuracy and interpretability, establishing a novel paradigm for collaborative multi-image reasoning.

Technology Category

Application Category

📝 Abstract
While previous multimodal slow-thinking methods have demonstrated remarkable success in single-image understanding scenarios, their effectiveness becomes fundamentally constrained when extended to more complex multi-image comprehension tasks. This limitation stems from their predominant reliance on text-based intermediate reasoning processes. While for human, when engaging in sophisticated multi-image analysis, they typically perform two complementary cognitive operations: (1) continuous cross-image visual comparison through region-of-interest matching, and (2) dynamic memorization of critical visual concepts throughout the reasoning chain. Motivated by these observations, we propose the Complex Multi-Modal Chain-of-Thought (CMMCoT) framework, a multi-step reasoning framework that mimics human-like"slow thinking"for multi-image understanding. Our approach incorporates two key innovations: 1. The construction of interleaved multimodal multi-step reasoning chains, which utilize critical visual region tokens, extracted from intermediate reasoning steps, as supervisory signals. This mechanism not only facilitates comprehensive cross-modal understanding but also enhances model interpretability. 2. The introduction of a test-time memory augmentation module that expands the model reasoning capacity during inference while preserving parameter efficiency. Furthermore, to facilitate research in this direction, we have curated a novel multi-image slow-thinking dataset. Extensive experiments demonstrate the effectiveness of our model.
Problem

Research questions and friction points this paper is trying to address.

Enhances multi-image comprehension via multi-modal reasoning.
Introduces memory augmentation for improved inference capacity.
Addresses limitations of text-based reasoning in multi-image tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaved multimodal multi-step reasoning chains
Test-time memory augmentation module
Novel multi-image slow-thinking dataset
🔎 Similar Papers
No similar papers found.
G
Guanghao Zhang
Alibaba Group
T
Tao Zhong
Alibaba Group
Y
Yan Xia
Alibaba Group, Zhejiang University
Z
Zhelun Yu
Alibaba Group
H
Haoyuan Li
Alibaba Group
Wanggui He
Wanggui He
Researcher, Alibaba Group
ai
Fangxun Shu
Fangxun Shu
Bytedance
Multimodal
Mushui Liu
Mushui Liu
Zhejiang University
Generative ModelsMulti-modal LearningFew-shot Learning
D
D. She
Alibaba Group
Y
Yi Wang
Alibaba Group, Zhejiang University
H
Hao Jiang
Alibaba Group