Learning to Think Fast and Slow for Visual Language Models

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Current vision-language models (VLMs) commonly rely on lengthy reasoning chains—such as explicit chain-of-thought prompting or rule-based reinforcement learning rewards—resulting in high computational overhead and inefficient resource utilization. To address this, we propose DualMindVLM, the first VLM to incorporate a dual-mode “fast/slow thinking” mechanism, enabling dynamic selection of inference paths based on task difficulty: short outputs for simple tasks (fast thinking) and extended reasoning for complex ones (slow thinking). Our method implicitly encodes the thinking mode via output length and employs a two-stage training strategy: (1) supervised fine-tuning to learn the length–difficulty mapping, followed by (2) GRPO-based reinforcement learning to optimize the adaptive policy. Experiments demonstrate that DualMindVLM achieves state-of-the-art visual reasoning performance while significantly improving token efficiency—reducing average inference tokens by 42%—thereby enabling adaptive, cognitively efficient allocation of computational resources.

Technology Category

Application Category

📝 Abstract

When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.

Problem

Research questions and friction points this paper is trying to address.

Enabling VLMs to switch between fast and slow thinking modes

Reducing computational costs in visual language model reasoning

Automating thinking mode selection based on task difficulty

Innovation

Methods, ideas, or system contributions that make the work stand out.

Switches between fast and slow thinking modes

Uses output length to label thinking mode data

Trains model with GRPO for dual-mode thinking

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts