Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This study investigates how chain-of-thought (CoT) design influences the generalization capability of vision-language models (VLMs) on vision-centric reasoning tasks. Using a controlled maze-solving benchmark, we systematically compare three CoT paradigms—language-descriptive, coordinate-based, and visual-operational—implemented on Qwen2.5-VL-7B and trained via supervised fine-tuning followed by reinforcement learning to auto-generate intermediate reasoning steps. Results reveal that the minimal coordinate-based CoT—retaining only essential spatial position information—achieves superior cross-scale generalization, exhibiting a “shorter is stronger” effect: it converges faster and attains significantly higher final accuracy than verbose language-based or visual-operational CoTs. This challenges the prevailing assumption that longer reasoning chains inherently yield better performance, and provides the first empirical evidence that concise, spatially grounded representations constitute a fundamental design principle for enhancing VLM generalization in visual reasoning.

Technology Category

Application Category

📝 Abstract

We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). While CoT data, especially long or visual CoT such as "think with image", has been widely used to supervise intermediate reasoning, it remains unclear why specific CoT designs help and which ones truly support generalizable reasoning. To systematically evaluate this, we focus on a controlled maze-solving benchmark where reasoning rules are fully visual, difficulty can be tuned by grid size, and all the intermediate steps can be automatically generated. Using Qwen2.5-VL-7B under a standard SFT-then-RL pipeline, we compare three representative CoT formats: Language CoT, Grounding CoT (with spatial coordinate trajectories), and Visual CoT (with image manipulations). Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling; concise CoT containing only essential grounding steps outperforms longer traces; and, strikingly, CoT retaining only the minimal grounding results generalizes best across different maze sizes. We further validate these insights on other vision-centric tasks. These findings highlight a "short is long" effect and provide practical guidance for constructing more generalizable SFT datasets for visual reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluates Chain-of-Thought designs for visual reasoning generalization

Compares language, grounding, and visual CoT formats in maze-solving

Finds concise, essential grounding CoT generalizes best across tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Controlled maze-solving benchmark for systematic CoT evaluation

Concise grounding CoT outperforms longer visual reasoning traces

Minimal grounding steps generalize best across varying task difficulties

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?