On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

๐Ÿ“… 2026-05-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limited improvement in perceptual capabilities of vision-language models during post-training, which constrains end-to-end visual reasoning performance despite notable gains in reasoning ability. The study systematically identifies and diagnoses an optimization asymmetry between perception and reasoning in post-training, introducing a diagnostic framework that decouples their evaluation. To mitigate this imbalance, the authors propose dynamic loss reweighting for supervised fine-tuning and a perception-aware reward mechanism for reinforcement learningโ€”both operating without additional annotations. Experiments demonstrate substantial performance gains: up to 18.2 points in supervised fine-tuning and 6.0 points under reinforcement learning on end-to-end visual reasoning tasks, with a consistent 3.2-point improvement even in the absence of ground-truth rewards.
๐Ÿ“ Abstract
Post-training has greatly improved reasoning in frontier vision-language models, yet its gains for perception remain comparatively limited, creating a bottleneck for end-to-end visual reasoning. To investigate this gap, we introduce a controlled diagnostic framework with two synthetic tasks that disentangle perception from reasoning. Our analysis reveals a consistent perception-reasoning asymmetry: posttraining improves reasoning more substantially than perception, though the underlying mechanism differs by training paradigm. For supervised fine-tuning (SFT), this asymmetry stems from token imbalance in chain-of-thought supervision, where perception occupies fewer tokens and thus receives a weaker training signal. Dynamically reweighting the loss mitigates this imbalance and boosts end-to-end performance by up to 18.2. For reinforcement learning (RL), the asymmetry instead arises from reward coupling: outcome rewards correlate more strongly with reasoning than with perception, weakening the signal for perception learning. Adding a perception-aware reward alleviates the imbalance and improves end-to-end accuracy by up to 6.0; even without groundtruth perception rewards, a reliable surrogate reward provide useful signal, yielding gains of 3.2 points. Together, our results comprehensively diagnose asymmetric optimization and suggest concrete interventions to balance perception and reasoning.
Problem

Research questions and friction points this paper is trying to address.

asymmetric optimization
reasoning
perception
vision-language models
post-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

asymmetric optimization
perception-reasoning disentanglement
loss reweighting
perception-aware reward
vision-language post-training