🤖 AI Summary
This work addresses the issue of “overthinking” in existing visual reasoning models, which often generate redundant reasoning chains. To mitigate this inefficiency, the authors propose AVR, an adaptive visual reasoning framework that introduces, for the first time, a reasoning-path adaptation mechanism. The framework decouples reasoning into three stages—visual perception, logical inference, and answer formulation—and dynamically selects among three response formats: full reasoning, perception-only, or direct answering. Trained with FS-GRPO (Filtered-Set Group Relative Policy Optimization), AVR prioritizes computationally efficient reasoning formats while maintaining competitive accuracy. Experimental results demonstrate that AVR reduces token consumption by 50%–90% across multiple vision-language benchmarks with minimal impact on accuracy, achieving particularly pronounced gains on perception-intensive tasks.
📝 Abstract
Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to \textbf{Reasoning Path Redundancy} in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose \textbf{AVR}, an adaptive visual reasoning framework that decomposes visual reasoning into three cognitive functions: visual perception, logical reasoning, and answer application. It further enables models to dynamically choose among three response formats: Full Format, Perception-Only Format, and Direct Answer. AVR is trained with FS-GRPO, an adaptation of Group Relative Policy Optimization that encourages the model to select the most efficient reasoning format while preserving correctness. Experiments on multiple vision-language benchmarks show that AVR reduces token usage by 50--90\% while maintaining overall accuracy, especially in perception-intensive tasks. These results demonstrate that adaptive visual reasoning can effectively mitigate overthinking in VRMs. Code and data are available at: https://github.com/RunRiotComeOn/AVR.