DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the limitations of existing internet-scale pretrained models, which lack native support for embodied tasks and struggle to effectively integrate semantic understanding with physical interaction. To bridge this gap, the authors propose DM0—the first natively embodied unified vision-language-action (VLA) framework. DM0 employs a three-stage training pipeline that fuses heterogeneous multimodal data to jointly learn high-level semantics and low-level physical priors. Key innovations include Embodied Spatial Scaffolding to construct spatial chains of thought, flow-matching action experts for precise motor control, and a mixed-gradient training strategy that preserves general-purpose representations. Evaluated on the RoboChallenge benchmark Table30, DM0 achieves state-of-the-art performance under both Specialist and Generalist evaluation settings.

Technology Category

Application Category

📝 Abstract

Moving beyond the traditional paradigm of adapting internet-pretrained models to physical tasks, we present DM0, an Embodied-Native Vision-Language-Action (VLA) framework designed for Physical AI. Unlike approaches that treat physical grounding as a fine-tuning afterthought, DM0 unifies embodied manipulation and navigation by learning from heterogeneous data sources from the onset. Our methodology follows a comprehensive three-stage pipeline: Pretraining, Mid-Training, and Post-Training. First, we conduct large-scale unified pretraining on the Vision-Language Model (VLM) using diverse corpora--seamlessly integrating web text, autonomous driving scenarios, and embodied interaction logs-to jointly acquire semantic knowledge and physical priors. Subsequently, we build a flow-matching action expert atop the VLM. To reconcile high-level reasoning with low-level control, DM0 employs a hybrid training strategy: for embodied data, gradients from the action expert are not backpropagated to the VLM to preserve generalized representations, while the VLM remains trainable on non-embodied data. Furthermore, we introduce an Embodied Spatial Scaffolding strategy to construct spatial Chain-of-Thought (CoT) reasoning, effectively constraining the action solution space. Experiments on the RoboChallenge benchmark demonstrate that DM0 achieves state-of-the-art performance in both Specialist and Generalist settings on Table30.

Problem

Research questions and friction points this paper is trying to address.

Embodied AI

Vision-Language-Action Model

Physical AI

Embodied Grounding

Robotic Manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Embodied-Native

Vision-Language-Action Model

Spatial Chain-of-Thought