MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the common reliance of simulation-to-reality (sim-to-real) transfer on real-world data or task-specific fine-tuning by proposing a purely simulation-driven zero-shot transfer approach. Leveraging procedurally generated expert trajectories—comprising 1.8 million highly diverse demonstrations—and integrating the Molmo2 vision-language model with a flow-matching action head and variants of π₀/SPOC policies, the method achieves cross-domain manipulation without any real-world fine-tuning. Evaluated on Franka FR3 and RB-Y1 platforms, it attains a 79.2% success rate on previously unseen objects and environments, demonstrating for the first time that large-scale, diverse simulation alone can effectively enable zero-shot transfer to real-world static and dynamic manipulation tasks.

Technology Category

Application Category

📝 Abstract
A prevailing view in robot learning is that simulation alone is not enough; effective sim-to-real transfer is widely believed to require at least some real-world data collection or task-specific fine-tuning to bridge the gap between simulated and physical environments. We challenge that assumption. With sufficiently large-scale and diverse simulated synthetic training data, we show that zero-shot transfer to the real world is not only possible, but effective for both static and mobile manipulation. We introduce MolmoBot-Engine, a fully open-source pipeline for procedural data generation across robots, tasks, and diverse simulated environments in MolmoSpaces. With it, we release MolmoBot-Data, a dataset of 1.8 million expert trajectories for articulated object manipulation and pick-and-place tasks. We train three policy classes: MolmoBot, a Molmo2-based multi-frame vision-language model with a flow-matching action head; MolmoBot-Pi0, which replicates the $π_0$ architecture to enable direct comparison; and MolmoBot-SPOC, a lightweight policy suitable for edge deployment and amenable to RL fine-tuning. We evaluate on two robotic platforms: the Franka FR3 for tabletop manipulation tasks and the Rainbow Robotics RB-Y1 mobile manipulator for door opening, drawer manipulation, cabinet interaction, and mobile pick-and-place. Without any real-world fine-tuning, our policies achieve zero-shot transfer to unseen objects and environments. On tabletop pick-and-place, MolmoBot achieves a success rate of 79.2% in real world evaluations across 4 settings, outperforming $π_{0.5}$ at 39.2%. Our results demonstrate that procedural environment generation combined with diverse articulated assets can produce robust manipulation policies that generalize broadly to the real world. Technical Blog: https://allenai.org/blog/molmobot-robot-manipulation
Problem

Research questions and friction points this paper is trying to address.

sim-to-real transfer
zero-shot manipulation
robot learning
simulation
generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot transfer
large-scale simulation
procedural data generation
vision-language policy
sim-to-real
🔎 Similar Papers
Abhay Deshpande
Abhay Deshpande
Allen Institute for Artificial Intelligence
RoboticsMachine Learning
M
Maya Guru
Allen Institute for AI
Rose Hendrix
Rose Hendrix
Research Engineer @ PRIOR, AI2
roboticsmachine learning
Snehal Jauhri
Snehal Jauhri
Technische Universität Darmstadt
RoboticsMachine LearningComputer Vision
Ainaz Eftekhar
Ainaz Eftekhar
PhD Student, University of Washington
Computer visionReinforcement LearningEmbodied AIRoboticsMachine learning
R
Rohun Tripathi
Allen Institute for AI
Max Argus
Max Argus
University of Freiburg
CV | ML | Robotics
Jordi Salvador
Jordi Salvador
Allen Institute for AI
Computer VisionMachine LearningEmbodied AI
Haoquan Fang
Haoquan Fang
University of Washington, Allen Institute for AI
Computer VisionMachine LearningEmbodied AIRobotics
Matthew Wallingford
Matthew Wallingford
University of Washington
Machine LearningComputer Vision
W
Wilbert Pumacay
Allen Institute for AI
Yejin Kim
Yejin Kim
Ai2
Q
Quinn Pfeifer
University of Washington
Y
Ying-Chun Lee
University of Washington
Piper Wolters
Piper Wolters
Research Engineer, Allen Institute for AI
Computer VisionDeep Learning
Omar Rayyan
Omar Rayyan
UCLA
RoboticsMachine Learning
Mingtong Zhang
Mingtong Zhang
University of Southern California
Computer VisionRoboticsRobot Learning
Jiafei Duan
Jiafei Duan
Computer Science PhD Student, University of Washington
RoboticsRobot LearningEmbodied AIRobotic Manipulation
K
Karen Farley
Allen Institute for AI
W
Winson Han
Allen Institute for AI
E
Eli Vanderbilt
Allen Institute for AI
Dieter Fox
Dieter Fox
University of Washington and AI2
RoboticsArtificial IntelligenceComputer Vision
Ali Farhadi
Ali Farhadi
Professor, Computer Science and Engineering, University of Washington
Computer VisionMachine learningArtificial Intelligence
Georgia Chalvatzaki
Georgia Chalvatzaki
Professor for Interactive Robot Perception and Learning, Technische Universität Darmstadt
RoboticsMachine LearningReinforcement LearningRobot PerceptionHRI
Dhruv Shah
Dhruv Shah
Princeton University, Google DeepMind
Robot LearningArtificial IntelligenceRoboticsReinforcement Learning