🤖 AI Summary
This work addresses the inefficiency and accuracy degradation of large vision-language models (VLMs) on simple tasks, where over-reasoning often leads to unnecessarily verbose responses. While prior approaches overlook visual perception failure as a fundamental bottleneck, this paper proposes GPRO, a novel framework that decouples perception failures from reasoning errors for the first time. GPRO constructs supervision signals based on failure attribution and introduces a meta-reasoning controller that dynamically selects among a lightweight fast path, a slow perception path, or a slow reasoning path. Leveraging a teacher model to generate approximately 790,000 failure-attribution labels, the path selection strategy is optimized via multi-objective reinforcement learning. Experiments demonstrate that GPRO significantly improves both accuracy and inference efficiency across five benchmarks, outperforming existing "slow thinking" methods while producing more concise responses.
📝 Abstract
Large Vision-Language Models (LVLMs) have exhibited strong reasoning capabilities through chain-of-thought mechanisms that generate step-by-step rationales. However, such slow-thinking approaches often lead to overthinking, where models produce excessively verbose responses even for simple queries, resulting in test-time inefficiency and even degraded accuracy. Prior work has attempted to mitigate this issue via adaptive reasoning strategies, but these methods largely overlook a fundamental bottleneck: visual perception failures. We argue that stable reasoning critically depends on low-level visual grounding, and that reasoning errors often originate from imperfect perception rather than insufficient deliberation. To address this limitation, we propose Gated Perception-Reasoning Optimization (GPRO), a meta-reasoning controller that dynamically routes computation among three decision paths at each generation step: a lightweight fast path, a slow perception path for re-examining visual inputs, and a slow reasoning path for internal self-reflection. To learn this distinction, we derive large-scale failure attribution supervision from approximately 790k samples, using teacher models to distinguish perceptual hallucinations from reasoning errors. We then train the controller with multi-objective reinforcement learning to optimize the trade-off between task accuracy and computational cost under uncertainty. Experiments on five benchmarks demonstrate that GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating significantly shorter responses.