๐ค AI Summary
This work addresses the limited improvement in perceptual capabilities of vision-language models during post-training, which constrains end-to-end visual reasoning performance despite notable gains in reasoning ability. The study systematically identifies and diagnoses an optimization asymmetry between perception and reasoning in post-training, introducing a diagnostic framework that decouples their evaluation. To mitigate this imbalance, the authors propose dynamic loss reweighting for supervised fine-tuning and a perception-aware reward mechanism for reinforcement learningโboth operating without additional annotations. Experiments demonstrate substantial performance gains: up to 18.2 points in supervised fine-tuning and 6.0 points under reinforcement learning on end-to-end visual reasoning tasks, with a consistent 3.2-point improvement even in the absence of ground-truth rewards.
๐ Abstract
Post-training has greatly improved reasoning in frontier vision-language models, yet its gains for perception remain comparatively limited, creating a bottleneck for end-to-end visual reasoning. To investigate this gap, we introduce a controlled diagnostic framework with two synthetic tasks that disentangle perception from reasoning. Our analysis reveals a consistent perception-reasoning asymmetry: posttraining improves reasoning more substantially than perception, though the underlying mechanism differs by training paradigm. For supervised fine-tuning (SFT), this asymmetry stems from token imbalance in chain-of-thought supervision, where perception occupies fewer tokens and thus receives a weaker training signal. Dynamically reweighting the loss mitigates this imbalance and boosts end-to-end performance by up to 18.2. For reinforcement learning (RL), the asymmetry instead arises from reward coupling: outcome rewards correlate more strongly with reasoning than with perception, weakening the signal for perception learning. Adding a perception-aware reward alleviates the imbalance and improves end-to-end accuracy by up to 6.0; even without groundtruth perception rewards, a reliable surrogate reward provide useful signal, yielding gains of 3.2 points. Together, our results comprehensively diagnose asymmetric optimization and suggest concrete interventions to balance perception and reasoning.