MolmoAct2: Action Reasoning Models for Real-world Deployment

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
This work addresses the limitations of existing vision–language–action (VLA) models in real-world robotic deployment—namely their closed nature, high cost, high latency, and low success rates—by introducing MolmoAct2, an open-source VLA model featuring three key innovations: a dedicated vision–language backbone (MolmoER), an open action tokenizer (OpenFAST) paired with a flow-matching continuous-action expert, and an adaptive deep reasoning mechanism (MolmoThink). Trained on a large-scale dual-arm robot dataset using a “specialize-then-rehearse” strategy, MolmoAct2 outperforms strong baselines such as Pi-05 across seven simulation and real-world benchmarks and surpasses GPT-5 and Gemini Robotics ER-1.5 on 13 embodied reasoning tasks. The model, along with its code and data, is fully open-sourced.
📝 Abstract
Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
real-world deployment
action reasoning
embodied reasoning
open-weight models
Innovation

Methods, ideas, or system contributions that make the work stand out.

MolmoAct2
embodied reasoning
open-weight VLA
adaptive-depth reasoning
continuous-action expert
Haoquan Fang
Haoquan Fang
University of Washington, Allen Institute for AI
Computer VisionMachine LearningEmbodied AIRobotics
Jiafei Duan
Jiafei Duan
Computer Science PhD Student, University of Washington
RoboticsRobot LearningEmbodied AIRobotic Manipulation
D
Donovan Clay
Allen Institute for AI, University of Washington
Sam Wang
Sam Wang
Professor of Neuroscience, Princeton University
NeuroscienceStatistical PoliticsTwo-photon microscopyAutismCerebellum
Shuo Liu
Shuo Liu
University of Washington, Allen Institute of AI
RoboticsArtificial intelligence
W
Weikai Huang
Allen Institute for AI, University of Washington
X
Xiang Fan
Allen Institute for AI, University of Washington
W
Wei-Chuan Tsai
University of Washington
S
Shirui Chen
Allen Institute for AI, University of Washington
Yi Ru Wang
Yi Ru Wang
University of Washington
Computer VisionRoboticsMachine Learning
S
Shanli Xing
University of Washington
Jaemin Cho
Jaemin Cho
PhD Student at UNC Chapel Hill
Multimodal LearningNatural Language ProcessingMachine Learning
J
Jae Sung Park
Allen Institute for AI
Ainaz Eftekhar
Ainaz Eftekhar
PhD Student, University of Washington
Computer visionReinforcement LearningEmbodied AIRoboticsMachine learning
Peter Sushko
Peter Sushko
Allen Institute for AI
AI
K
Karen Farley
Allen Institute for AI
A
Angad Wadhwa
University of Washington
C
Cole Harrison
Amazon
W
Winson Han
Allen Institute for AI
Y
Ying-Chun Lee
University of Washington
Eli VanderBilt
Eli VanderBilt
Technical Artist
Rose Hendrix
Rose Hendrix
Research Engineer @ PRIOR, AI2
roboticsmachine learning
Suveen Ellawela
Suveen Ellawela
Undergraduate at National University of Singapore
HCIUser Experience DesignLLMsRAG
L
Lucas Ngoo
Cortex AI
J
Joyce Chai
University of Michigan