MolmoAct: Action Reasoning Models that can Reason in Space

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Current robotic foundation models directly map perception-instructions to control actions, lacking interpretability, generalization capability, and semantic grounding. To address this, we propose Action Reasoning Models (ARMs), a unified three-stage architecture—perception → planning → control—that introduces editable mid-level trajectory representations as spatial plans for the first time, enabling semantic alignment and flexible behavioral adaptation. We release MolmoAct, the first large-scale robot action reasoning dataset, and open-source all models and code. ARMs integrate deep perceptual vision-language encoding, mid-level trajectory generation, and low-level action prediction. Evaluated on SimplerEnv, ARMs achieve 70.5% zero-shot accuracy; on LIBERO, they attain an average success rate of 86.6%, outperforming ThinkAct by 6.3% on long-horizon tasks; on real-world bimanual manipulation, success improves by 22.7%; out-of-distribution generalization exceeds baselines by 23.3%; and human preference scores rank highest.

Technology Category

Application Category

📝 Abstract

Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of vision-language-action models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. MolmoAct-7B-D achieves strong performance across simulation and real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source Pi-0 and GR00T N1; 86.6% average success on LIBERO, including an additional 6.3% gain over ThinkAct on long-horizon tasks; and in real-world fine-tuning, an additional 10% (single-arm) and an additional 22.7% (bimanual) task progression over Pi-0-FAST. It also outperforms baselines by an additional 23.3% on out-of-distribution generalization and achieves top human-preference scores for open-ended instruction following and trajectory steering. Furthermore, we release, for the first time, the MolmoAct Dataset -- a mid-training robot dataset comprising over 10,000 high quality robot trajectories across diverse scenarios and tasks. Training with this dataset yields an average 5.5% improvement in general performance over the base model. We release all model weights, training code, our collected dataset, and our action reasoning dataset, establishing MolmoAct as both a state-of-the-art robotics foundation model and an open blueprint for building ARMs that transform perception into purposeful action through structured reasoning. Blogpost: https://allenai.org/blog/molmoact

Problem

Research questions and friction points this paper is trying to address.

Robotic models lack adaptability and semantic grounding in reasoning.

Need for vision-language-action models integrating perception, planning, and control.

Current models struggle with generalization and explainable behavior in robotics.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Depth-aware perception tokens enhance spatial understanding

Editable trajectory traces for mid-level spatial planning

Precise low-level action prediction improves task performance

🔎 Similar Papers

No similar papers found.