Hume: Introducing System-2 Thinking in Visual-Language-Action Model

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Vision-Language-Action (VLA) models lack human-like deliberative reasoning and struggle with complex, physically grounded tasks. To address this, we propose Hume, a dual-system VLA model featuring a novel value-guided System-2 deliberative module that performs multi-candidate action evaluation and global planning, coupled with a lightweight System-1 reactive execution module that enables fine-grained control via asynchronous cascaded action denoising. These two modules operate on a unified vision-language-action joint representation yet remain decoupled and synergistic, supporting temporal separation and dynamic scheduling of reasoning and execution. Evaluated across multiple simulation benchmarks and real-world robotic platforms, Hume significantly outperforms state-of-the-art methods, demonstrating superior robustness, generalization, and cross-task transferability—particularly in dexterous manipulation tasks.

Technology Category

Application Category

📝 Abstract
Humans practice slow thinking before performing actual actions when handling complex tasks in the physical world. This thinking paradigm, recently, has achieved remarkable advancement in boosting Large Language Models (LLMs) to solve complex tasks in digital domains. However, the potential of slow thinking remains largely unexplored for robotic foundation models interacting with the physical world. In this work, we propose Hume: a dual-system Vision-Language-Action (VLA) model with value-guided System-2 thinking and cascaded action denoising, exploring human-like thinking capabilities of Vision-Language-Action models for dexterous robot control. System 2 of Hume implements value-Guided thinking by extending a Vision-Language-Action Model backbone with a novel value-query head to estimate the state-action value of predicted actions. The value-guided thinking is conducted by repeat sampling multiple action candidates and selecting one according to state-action value. System 1 of Hume is a lightweight reactive visuomotor policy that takes System 2 selected action and performs cascaded action denoising for dexterous robot control. At deployment time, System 2 performs value-guided thinking at a low frequency while System 1 asynchronously receives the System 2 selected action candidate and predicts fluid actions in real time. We show that Hume outperforms the existing state-of-the-art Vision-Language-Action models across multiple simulation benchmark and real-robot deployments.
Problem

Research questions and friction points this paper is trying to address.

Exploring slow thinking in robotic foundation models for physical world tasks
Developing a dual-system VLA model with value-guided System-2 thinking
Enhancing dexterous robot control via cascaded action denoising
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-system Vision-Language-Action model with System-2 thinking
Value-guided thinking via value-query head
Cascaded action denoising for robot control
🔎 Similar Papers
H
Haoming Song
Shanghai Jiao Tong University, Shanghai AI Laboratory
Delin Qu
Delin Qu
PhD Candidate of Fudan University
Embodied AI3D VisionMultimodal Generation
Yuanqi Yao
Yuanqi Yao
INSAIT
RoboticsManipulation
Qizhi Chen
Qizhi Chen
PhD Candidate of Zhejiang University
Multimodal ReasoningEmbodied AI3D Vision
Q
Qi Lv
Shanghai AI Laboratory
Y
Yiwen Tang
Shanghai AI Laboratory
Modi Shi
Modi Shi
Beihang University
embodied ai
G
Guanghui Ren
AgiBot
Maoqing Yao
Maoqing Yao
Google
B
Bin Zhao
Shanghai AI Laboratory
D
Dong Wang
Shanghai AI Laboratory
X
Xuelong Li
Shanghai AI Laboratory