Fast-Slow Thinking for Large Vision-Language Model Reasoning

📅 2025-04-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from pervasive “over-reasoning”: they generate unnecessarily lengthy and inefficient reasoning chains regardless of question difficulty, undermining the efficiency–accuracy trade-off. To address this, we propose Fast-Slow Thinking (FAST), the first dynamic dual-mode reasoning framework for LVLMs that adaptively modulates reasoning depth based on problem characteristics. FAST employs a model-driven problem representation to estimate difficulty, coupled with adaptive thinking rewards and difficulty-aware KL regularization for fine-grained control over reasoning depth. The method integrates joint analysis of response length and data distribution, GRPO-based reinforcement learning, and dynamic reward modeling. Evaluated across seven visual reasoning benchmarks, FAST achieves state-of-the-art performance, delivering an average accuracy gain of over 10% while reducing inference token consumption by 32.7–67.3%, thereby significantly enhancing the synergy between efficiency and accuracy.

Technology Category

Application Category

📝 Abstract
Recent advances in large vision-language models (LVLMs) have revealed an extit{overthinking} phenomenon, where models generate verbose reasoning across all tasks regardless of questions. To address this issue, we present extbf{FAST}, a novel extbf{Fa}st- extbf{S}low extbf{T}hinking framework that dynamically adapts reasoning depth based on question characteristics. Through empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs by investigating how response length and data distribution affect performance. We develop FAST-GRPO with three components: model-based metrics for question characterization, an adaptive thinking reward mechanism, and difficulty-aware KL regularization. Experiments across seven reasoning benchmarks demonstrate that FAST achieves state-of-the-art accuracy with over 10% relative improvement compared to the base model, while reducing token usage by 32.7-67.3% compared to previous slow-thinking approaches, effectively balancing reasoning length and accuracy.
Problem

Research questions and friction points this paper is trying to address.

Addresses overthinking in large vision-language models
Adapts reasoning depth based on question characteristics
Balances reasoning accuracy and token efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic reasoning depth adaptation
Adaptive thinking reward mechanism
Difficulty-aware KL regularization
🔎 Similar Papers
No similar papers found.
Wenyi Xiao
Wenyi Xiao
Zhejiang University
Leilei Gan
Leilei Gan
Zhejiang University
NLPLLMsMultimodal LLMsAI+X
W
Weilong Dai
Alibaba Group
Wanggui He
Wanggui He
Researcher, Alibaba Group
ai
Ziwei Huang
Ziwei Huang
Zhejiang University
Multimodal LLMsAIGC
H
Haoyuan Li
Alibaba Group
Fangxun Shu
Fangxun Shu
Bytedance
Multimodal
Z
Zhelun Yu
Alibaba Group
P
Peng Zhang
Alibaba Group
H
Hao Jiang
Alibaba Group
F
Fei Wu
Zhejiang University