🤖 AI Summary
This work addresses the bottleneck of current vision-language models in visual tasks, which stems primarily from insufficient visual perception rather than reasoning capability per se. The authors propose a novel staged training paradigm based on capability decoupling, partitioning post-training into three sequential phases: visual perception, visual reasoning, and textual reasoning. By prioritizing the enhancement of perceptual abilities through reinforcement learning and specialized datasets before progressively refining reasoning skills, this approach yields performance gains orthogonal to—and combinable with—traditional curriculum learning. Experiments demonstrate that the method improves model accuracy by 5.2% on WeMath and 3.7% on RealWorldQA, increases reasoning accuracy by 1.5%, and shortens reasoning trajectories by 20.8%, thereby substantiating the central thesis that robust perception constitutes the foundation of effective reasoning.
📝 Abstract
Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.