Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Current vision-language models (VLMs) excel at linguistic reasoning but suffer from significant limitations in dense visual perception—particularly in understanding spatial relations, geometric structures, and fine-grained layout—due to the absence of effective mechanisms for modeling continuous cross-dimensional visual information. Method: We propose Chain-of-Visual-Thought (CoVT), the first framework to distill lightweight, expert-derived visual knowledge into approximately 20 continuous visual tokens in latent space, enabling interpretable and selectively decodable visual reasoning. CoVT employs autoregressive prediction of multi-density supervision signals—including depth, segmentation, edges, and DINO features—while maintaining compatibility with mainstream VLM architectures such as Qwen2.5-VL and LLaVA. Contribution/Results: Evaluated across over ten perception benchmarks (e.g., CV-Bench, MMVP), CoVT achieves consistent average improvements of 3–16%, substantially enhancing models’ spatial and geometric reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with COVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating COVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16% and demonstrates that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.

Problem

Research questions and friction points this paper is trying to address.

VLMs struggle with dense visual perception like spatial reasoning

Current VLMs lack mechanisms to capture spatial visual information

Models need better perceptual understanding beyond linguistic reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous visual tokens encode rich perceptual cues

Autoregressive prediction reconstructs dense supervision signals

Reasoning in visual token space improves multimodal performance

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM