🤖 AI Summary
Current vision-language models (VLMs) commonly rely on lengthy reasoning chains—such as explicit chain-of-thought prompting or rule-based reinforcement learning rewards—resulting in high computational overhead and inefficient resource utilization. To address this, we propose DualMindVLM, the first VLM to incorporate a dual-mode “fast/slow thinking” mechanism, enabling dynamic selection of inference paths based on task difficulty: short outputs for simple tasks (fast thinking) and extended reasoning for complex ones (slow thinking). Our method implicitly encodes the thinking mode via output length and employs a two-stage training strategy: (1) supervised fine-tuning to learn the length–difficulty mapping, followed by (2) GRPO-based reinforcement learning to optimize the adaptive policy. Experiments demonstrate that DualMindVLM achieves state-of-the-art visual reasoning performance while significantly improving token efficiency—reducing average inference tokens by 42%—thereby enabling adaptive, cognitively efficient allocation of computational resources.
📝 Abstract
When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.