DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI

πŸ“… 2026-02-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing internet-scale pretrained models, which lack native support for embodied tasks and struggle to effectively integrate semantic understanding with physical interaction. To bridge this gap, the authors propose DM0β€”the first natively embodied unified vision-language-action (VLA) framework. DM0 employs a three-stage training pipeline that fuses heterogeneous multimodal data to jointly learn high-level semantics and low-level physical priors. Key innovations include Embodied Spatial Scaffolding to construct spatial chains of thought, flow-matching action experts for precise motor control, and a mixed-gradient training strategy that preserves general-purpose representations. Evaluated on the RoboChallenge benchmark Table30, DM0 achieves state-of-the-art performance under both Specialist and Generalist evaluation settings.

Technology Category

Application Category

πŸ“ Abstract
Moving beyond the traditional paradigm of adapting internet-pretrained models to physical tasks, we present DM0, an Embodied-Native Vision-Language-Action (VLA) framework designed for Physical AI. Unlike approaches that treat physical grounding as a fine-tuning afterthought, DM0 unifies embodied manipulation and navigation by learning from heterogeneous data sources from the onset. Our methodology follows a comprehensive three-stage pipeline: Pretraining, Mid-Training, and Post-Training. First, we conduct large-scale unified pretraining on the Vision-Language Model (VLM) using diverse corpora--seamlessly integrating web text, autonomous driving scenarios, and embodied interaction logs-to jointly acquire semantic knowledge and physical priors. Subsequently, we build a flow-matching action expert atop the VLM. To reconcile high-level reasoning with low-level control, DM0 employs a hybrid training strategy: for embodied data, gradients from the action expert are not backpropagated to the VLM to preserve generalized representations, while the VLM remains trainable on non-embodied data. Furthermore, we introduce an Embodied Spatial Scaffolding strategy to construct spatial Chain-of-Thought (CoT) reasoning, effectively constraining the action solution space. Experiments on the RoboChallenge benchmark demonstrate that DM0 achieves state-of-the-art performance in both Specialist and Generalist settings on Table30.
Problem

Research questions and friction points this paper is trying to address.

Embodied AI
Vision-Language-Action Model
Physical AI
Embodied Grounding
Robotic Manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Embodied-Native
Vision-Language-Action Model
Spatial Chain-of-Thought
Hybrid Training Strategy
Physical AI
πŸ”Ž Similar Papers
E
En Yu
H
Haoran Lv
Jianjian Sun
Jianjian Sun
Researcher of StepFun
LLMMulti-modal
K
Kangheng Lin
R
Ruitao Zhang
Y
Yukang Shi
Y
Yuyang Chen
Ze Chen
Ze Chen
Alibaba Group
Comuter Vision
Z
Ziheng Zhang
Fan Jia
Fan Jia
Faculty of Chemistry and Biochemistry, Ruhr-University of Bochum
Organic Chemistry
K
Kaixin Liu
M
Meng Zhang
R
Ruitao Hao
S
Saike Huang
S
Songhan Xie
Y
Yu Liu
Zhao Wu
Zhao Wu
NIO
Machine Learning System
Bin Xie
Bin Xie
InfoBeyond Technology LLC
Mobile ComuptingSecurityBig Data Streaming
P
Pengwei Zhang
Q
Qi Yang
X
Xianchi Deng
Y
Yunfei Wei
E
Enwen Zhang
H
Hongyang Peng
J
Jie Zhao