🤖 AI Summary
Existing vision-language models often suffer from premature discretization of visual inputs into text, discarding continuous signals such as geometric and spatial layout information, and predominantly rely on passively processing precomputed features without active perception capabilities. To address this limitation, this work proposes ViThinker, a novel framework that introduces an active perception mechanism for the first time: the model autonomously generates query tokens to synthesize task-relevant, expert-aligned visual features on demand, effectively internalizing visual expertise and enabling task-driven, minimally sufficient perception without external tools. Through expert distillation, sparsity-constrained query learning, generative feature synthesis, and a two-stage curriculum training strategy, ViThinker significantly outperforms passive approaches across multiple vision-centric benchmarks, achieving notable advances in both perceptual grounding and reasoning accuracy.
📝 Abstract
Chain-of-Thought (CoT) reasoning excels in language models but struggles in vision-language models due to premature visual-to-text conversion that discards continuous information such as geometry and spatial layout. While recent methods enhance CoT through static enumeration or attention-based selection, they remain passive, i.e., processing pre-computed inputs rather than actively seeking task-relevant details. Inspired by human active perception, we introduce ViThinker, a framework that enables vision-language models to autonomously generate decision (query) tokens triggering the synthesis of expert-aligned visual features on demand. ViThinker internalizes vision-expert capabilities during training, performing generative mental simulation during inference without external tool calls. Through a two-stage curriculum: first distilling frozen experts into model parameters, then learning task-driven querying via sparsity penalties, i.e., ViThinker discovers minimal sufficient perception for each reasoning step. Evaluations across vision-centric benchmarks demonstrate consistent improvements, validating that active query generation outperforms passive approaches in both perceptual grounding and reasoning accuracy.