SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the vulnerability of current vision-language models to minor perceptual errors during inference, which often leads to severe reasoning failures due to the tight coupling of perception and reasoning, as well as their reliance on costly handcrafted reward-based reinforcement learning. Inspired by the brain’s sensory-cognitive processing architecture, the authors propose SPARC—a novel framework that explicitly decouples perception from reasoning for the first time. SPARC first performs low-resolution global search to identify image regions relevant to the query, followed by high-resolution local processing for reasoning. This two-stage modular design enables asymmetric computation allocation, selective optimization, and context compression, significantly enhancing robustness and efficiency under distribution shifts. Experiments show that SPARC improves Qwen3VL-4B’s accuracy by 6.7 percentage points on the V* VQA benchmark and outperforms “thinking with images” by 4.6 percentage points on out-of-distribution tasks while using 200× fewer tokens.

Technology Category

Application Category

📝 Abstract

Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the $V^*$ VQA benchmark by 6.7 percentage points, and it surpasses"thinking with images"by 4.6 points on a challenging OOD task despite requiring a 200$\times$ lower token budget.

Problem

Research questions and friction points this paper is trying to address.

test-time scaling

vision-language models

perception-reasoning entanglement

visual reasoning

token efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time scaling

perception-reasoning decoupling

modular VLMs