🤖 AI Summary
This work addresses the limitations of existing cross-modal chain-of-thought methods, which often over-rely on a single coarse-grained image region and suffer from semantic discontinuities across reasoning steps in complex visual reasoning tasks. To overcome these issues, we propose CoCoT, a novel framework that introduces a dynamic multi-region visual focusing mechanism coupled with relation-aware reasoning to enable collaborative multi-region integration and construct coherent cross-modal reasoning chains. Additionally, we curate CoCoT-70K, a high-quality dataset comprising 74,691 samples. Extensive experiments demonstrate that CoCoT achieves substantial performance gains across six challenging benchmarks, improving average accuracy by 15.4% on LLaVA-1.5 and by 4.0% on Qwen2-VL.
📝 Abstract
Multi-modal reasoning requires the seamless integration of visual and linguistic cues, yet existing Chain-of-Thought methods suffer from two critical limitations in cross-modal scenarios: (1) over-reliance on single coarse-grained image regions, and (2) semantic fragmentation between successive reasoning steps. To address these issues, we propose the CoCoT (Collaborative Coross-modal Thought) framework, built upon two key innovations: a) Dynamic Multi-Region Grounding to adaptively detect the most relevant image regions based on the question, and b) Relation-Aware Reasoning to enable multi-region collaboration by iteratively aligning visual cues to form a coherent and logical chain of thought. Through this approach, we construct the CoCoT-70K dataset, comprising 74,691 high-quality samples with multi-region annotations and structured reasoning chains. Extensive experiments demonstrate that CoCoT significantly enhances complex visual reasoning, achieving an average accuracy improvement of 15.4% on LLaVA-1.5 and 4.0% on Qwen2-VL across six challenging benchmarks. The data and code are available at: https://github.com/deer-echo/CoCoT.