Watch Wider and Think Deeper: Collaborative Cross-modal Chain-of-Thought for Complex Visual Reasoning

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing cross-modal chain-of-thought methods, which often over-rely on a single coarse-grained image region and suffer from semantic discontinuities across reasoning steps in complex visual reasoning tasks. To overcome these issues, we propose CoCoT, a novel framework that introduces a dynamic multi-region visual focusing mechanism coupled with relation-aware reasoning to enable collaborative multi-region integration and construct coherent cross-modal reasoning chains. Additionally, we curate CoCoT-70K, a high-quality dataset comprising 74,691 samples. Extensive experiments demonstrate that CoCoT achieves substantial performance gains across six challenging benchmarks, improving average accuracy by 15.4% on LLaVA-1.5 and by 4.0% on Qwen2-VL.

Technology Category

Application Category

📝 Abstract
Multi-modal reasoning requires the seamless integration of visual and linguistic cues, yet existing Chain-of-Thought methods suffer from two critical limitations in cross-modal scenarios: (1) over-reliance on single coarse-grained image regions, and (2) semantic fragmentation between successive reasoning steps. To address these issues, we propose the CoCoT (Collaborative Coross-modal Thought) framework, built upon two key innovations: a) Dynamic Multi-Region Grounding to adaptively detect the most relevant image regions based on the question, and b) Relation-Aware Reasoning to enable multi-region collaboration by iteratively aligning visual cues to form a coherent and logical chain of thought. Through this approach, we construct the CoCoT-70K dataset, comprising 74,691 high-quality samples with multi-region annotations and structured reasoning chains. Extensive experiments demonstrate that CoCoT significantly enhances complex visual reasoning, achieving an average accuracy improvement of 15.4% on LLaVA-1.5 and 4.0% on Qwen2-VL across six challenging benchmarks. The data and code are available at: https://github.com/deer-echo/CoCoT.
Problem

Research questions and friction points this paper is trying to address.

cross-modal reasoning
visual-linguistic integration
chain-of-thought
multi-region grounding
semantic fragmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Multi-Region Grounding
Relation-Aware Reasoning
Cross-modal Chain-of-Thought
Multi-modal Reasoning
Visual-Linguistic Alignment
🔎 Similar Papers
No similar papers found.
W
Wenting Lu
Fujian Normal University
Didi Zhu
Didi Zhu
Imperial College London
Multi-Modal LLMsOut of Distribution Generalization
Tao Shen
Tao Shen
Zhejiang University
Distributed SystemsLarge-scale OptimizationFederated Learning
D
Donglin Zhu
Zhejiang Normal University
A
Ayong Ye
Fujian Normal University
C
Chao Wu
Zhejiang University