Fast-Slow Thinking for Large Vision-Language Model Reasoning

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) suffer from pervasive “over-reasoning”: they generate unnecessarily lengthy and inefficient reasoning chains regardless of question difficulty, undermining the efficiency–accuracy trade-off. To address this, we propose Fast-Slow Thinking (FAST), the first dynamic dual-mode reasoning framework for LVLMs that adaptively modulates reasoning depth based on problem characteristics. FAST employs a model-driven problem representation to estimate difficulty, coupled with adaptive thinking rewards and difficulty-aware KL regularization for fine-grained control over reasoning depth. The method integrates joint analysis of response length and data distribution, GRPO-based reinforcement learning, and dynamic reward modeling. Evaluated across seven visual reasoning benchmarks, FAST achieves state-of-the-art performance, delivering an average accuracy gain of over 10% while reducing inference token consumption by 32.7–67.3%, thereby significantly enhancing the synergy between efficiency and accuracy.

Technology Category

Application Category

📝 Abstract

Recent advances in large vision-language models (LVLMs) have revealed an extit{overthinking} phenomenon, where models generate verbose reasoning across all tasks regardless of questions. To address this issue, we present extbf{FAST}, a novel extbf{Fa}st- extbf{S}low extbf{T}hinking framework that dynamically adapts reasoning depth based on question characteristics. Through empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs by investigating how response length and data distribution affect performance. We develop FAST-GRPO with three components: model-based metrics for question characterization, an adaptive thinking reward mechanism, and difficulty-aware KL regularization. Experiments across seven reasoning benchmarks demonstrate that FAST achieves state-of-the-art accuracy with over 10% relative improvement compared to the base model, while reducing token usage by 32.7-67.3% compared to previous slow-thinking approaches, effectively balancing reasoning length and accuracy.

Problem

Research questions and friction points this paper is trying to address.

Addresses overthinking in large vision-language models

Adapts reasoning depth based on question characteristics

Balances reasoning accuracy and token efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic reasoning depth adaptation

Adaptive thinking reward mechanism

Difficulty-aware KL regularization

🔎 Similar Papers

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision Language Models