HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation

๐Ÿ“… 2025-02-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address weak generalization and the high cost of real-world robotic manipulation data in open-world settings, this paper proposes a decoupled hierarchical Vision-Language-Action (VLA) model. The high-level Vision-Language Model (VLM) predicts semantically grounded, coarse 2D end-effector trajectories, while a low-level 3D perception policy executes fine-grained physical control guided by these trajectories. This work introduces the first โ€œsemantic-motionโ€ hierarchical decoupling paradigm, enabling cross-domain transfer from action-free videos, hand-drawn sketches, and simulation data. By jointly optimizing VLM fine-tuning and 2D-path-guided 3D control, the method significantly improves out-of-distribution performance across seven generalization axes: robot morphology, dynamics, visual appearance, task semantics, and more. Real-robot experiments demonstrate an average success rate increase of 50% relative to OpenVLA (absolute +20%).

Technology Category

Application Category

๐Ÿ“ Abstract
Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is the lack of robotic data, which are typically obtained through expensive on-robot operation. A promising remedy is to leverage cheaper, off-domain data such as action-free videos, hand-drawn sketches or simulation data. In this work, we posit that hierarchical vision-language-action (VLA) models can be more effective in utilizing off-domain data than standard monolithic VLA models that directly finetune vision-language models (VLMs) to predict actions. In particular, we study a class of hierarchical VLA models, where the high-level VLM is finetuned to produce a coarse 2D path indicating the desired robot end-effector trajectory given an RGB image and a task description. The intermediate 2D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Doing so alleviates the high-level VLM from fine-grained action prediction, while reducing the low-level policy's burden on complex task-level reasoning. We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios, including differences on embodiments, dynamics, visual appearances and task semantics, etc. In the real-robot experiments, we observe an average of 20% improvement in success rate across seven different axes of generalization over OpenVLA, representing a 50% relative gain. Visual results are provided at: https://hamster-robot.github.io/
Problem

Research questions and friction points this paper is trying to address.

Enhance robot manipulation using hierarchical VLA models
Utilize off-domain data for robotic task generalization
Improve real-robot success rates through hierarchical design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical VLA models
2D path guidance
3D-aware control policy
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yi Li
NVIDIA, University of Washington
Yuquan Deng
Yuquan Deng
UW, AI2
Robotics
J
Jesse Zhang
NVIDIA, University of Southern California
Joel Jang
Joel Jang
Research Scientist, Nvidia
Marius Memmel
Marius Memmel
University of Washington
RoboticsReinforcement LearningComputer Vision
C
Caelan Reed Garrett
NVIDIA
Fabio Ramos
Fabio Ramos
University of Sydney and NVIDIA
roboticsmachine learning
Dieter Fox
Dieter Fox
University of Washington and AI2
RoboticsArtificial IntelligenceComputer Vision
A
Anqi Li
NVIDIA
A
Abhishek Gupta
NVIDIA, University of Washington
A
Ankit Goyal
NVIDIA