ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models often suffer from premature discretization of visual inputs into text, discarding continuous signals such as geometric and spatial layout information, and predominantly rely on passively processing precomputed features without active perception capabilities. To address this limitation, this work proposes ViThinker, a novel framework that introduces an active perception mechanism for the first time: the model autonomously generates query tokens to synthesize task-relevant, expert-aligned visual features on demand, effectively internalizing visual expertise and enabling task-driven, minimally sufficient perception without external tools. Through expert distillation, sparsity-constrained query learning, generative feature synthesis, and a two-stage curriculum training strategy, ViThinker significantly outperforms passive approaches across multiple vision-centric benchmarks, achieving notable advances in both perceptual grounding and reasoning accuracy.

Technology Category

Application Category

📝 Abstract
Chain-of-Thought (CoT) reasoning excels in language models but struggles in vision-language models due to premature visual-to-text conversion that discards continuous information such as geometry and spatial layout. While recent methods enhance CoT through static enumeration or attention-based selection, they remain passive, i.e., processing pre-computed inputs rather than actively seeking task-relevant details. Inspired by human active perception, we introduce ViThinker, a framework that enables vision-language models to autonomously generate decision (query) tokens triggering the synthesis of expert-aligned visual features on demand. ViThinker internalizes vision-expert capabilities during training, performing generative mental simulation during inference without external tool calls. Through a two-stage curriculum: first distilling frozen experts into model parameters, then learning task-driven querying via sparsity penalties, i.e., ViThinker discovers minimal sufficient perception for each reasoning step. Evaluations across vision-centric benchmarks demonstrate consistent improvements, validating that active query generation outperforms passive approaches in both perceptual grounding and reasoning accuracy.
Problem

Research questions and friction points this paper is trying to address.

vision-language reasoning
Chain-of-Thought
active perception
visual grounding
spatial layout
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active Vision-Language Reasoning
Dynamic Perceptual Querying
Chain-of-Thought
Generative Mental Simulation
Task-Driven Querying
🔎 Similar Papers
No similar papers found.
W
Weihang You
School of Computing, University of Georgia, Athens, GA, USA
Q
Qingchan Zhu
School of Computing, University of Georgia, Athens, GA, USA
D
David Liu
School of Engineering and Applied Science, Princeton University, Princeton, NJ, USA
Yi Pan
Yi Pan
University of Georgia
Brain-inspired AIArtificial General Intelligence
Geng Yuan
Geng Yuan
University of Georgia
Efficient AIExplainable AITrustworthy MLEdge ComputingAI Applications
Hanqi Jiang
Hanqi Jiang
University of Georgia
Medical Image AnalysisMulti-modal Large Language Models