Watch Wider and Think Deeper: Collaborative Cross-modal Chain-of-Thought for Complex Visual Reasoning

📅 2026-01-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the limitations of existing cross-modal chain-of-thought methods, which often over-rely on a single coarse-grained image region and suffer from semantic discontinuities across reasoning steps in complex visual reasoning tasks. To overcome these issues, we propose CoCoT, a novel framework that introduces a dynamic multi-region visual focusing mechanism coupled with relation-aware reasoning to enable collaborative multi-region integration and construct coherent cross-modal reasoning chains. Additionally, we curate CoCoT-70K, a high-quality dataset comprising 74,691 samples. Extensive experiments demonstrate that CoCoT achieves substantial performance gains across six challenging benchmarks, improving average accuracy by 15.4% on LLaVA-1.5 and by 4.0% on Qwen2-VL.

Technology Category

Application Category

📝 Abstract

Multi-modal reasoning requires the seamless integration of visual and linguistic cues, yet existing Chain-of-Thought methods suffer from two critical limitations in cross-modal scenarios: (1) over-reliance on single coarse-grained image regions, and (2) semantic fragmentation between successive reasoning steps. To address these issues, we propose the CoCoT (Collaborative Coross-modal Thought) framework, built upon two key innovations: a) Dynamic Multi-Region Grounding to adaptively detect the most relevant image regions based on the question, and b) Relation-Aware Reasoning to enable multi-region collaboration by iteratively aligning visual cues to form a coherent and logical chain of thought. Through this approach, we construct the CoCoT-70K dataset, comprising 74,691 high-quality samples with multi-region annotations and structured reasoning chains. Extensive experiments demonstrate that CoCoT significantly enhances complex visual reasoning, achieving an average accuracy improvement of 15.4% on LLaVA-1.5 and 4.0% on Qwen2-VL across six challenging benchmarks. The data and code are available at: https://github.com/deer-echo/CoCoT.

Problem

Research questions and friction points this paper is trying to address.

cross-modal reasoning

visual-linguistic integration

chain-of-thought

multi-region grounding

semantic fragmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Multi-Region Grounding

Relation-Aware Reasoning

Cross-modal Chain-of-Thought