🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from pervasive “over-reasoning”: they generate unnecessarily lengthy and inefficient reasoning chains regardless of question difficulty, undermining the efficiency–accuracy trade-off. To address this, we propose Fast-Slow Thinking (FAST), the first dynamic dual-mode reasoning framework for LVLMs that adaptively modulates reasoning depth based on problem characteristics. FAST employs a model-driven problem representation to estimate difficulty, coupled with adaptive thinking rewards and difficulty-aware KL regularization for fine-grained control over reasoning depth. The method integrates joint analysis of response length and data distribution, GRPO-based reinforcement learning, and dynamic reward modeling. Evaluated across seven visual reasoning benchmarks, FAST achieves state-of-the-art performance, delivering an average accuracy gain of over 10% while reducing inference token consumption by 32.7–67.3%, thereby significantly enhancing the synergy between efficiency and accuracy.
📝 Abstract
Recent advances in large vision-language models (LVLMs) have revealed an extit{overthinking} phenomenon, where models generate verbose reasoning across all tasks regardless of questions. To address this issue, we present extbf{FAST}, a novel extbf{Fa}st- extbf{S}low extbf{T}hinking framework that dynamically adapts reasoning depth based on question characteristics. Through empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs by investigating how response length and data distribution affect performance. We develop FAST-GRPO with three components: model-based metrics for question characterization, an adaptive thinking reward mechanism, and difficulty-aware KL regularization. Experiments across seven reasoning benchmarks demonstrate that FAST achieves state-of-the-art accuracy with over 10% relative improvement compared to the base model, while reducing token usage by 32.7-67.3% compared to previous slow-thinking approaches, effectively balancing reasoning length and accuracy.