ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work addresses the challenges of fragmented data, inconsistent representations, and misaligned training objectives faced by embodied agents operating across diverse hardware platforms. To this end, the authors propose Action Manifold Learning (AML), which leverages the action manifold hypothesis to project high-dimensional actions into a low-dimensional, continuous space constrained by physical priors, thereby enhancing policy stability and decoding efficiency. They introduce UniACT, a standardized dataset comprising 6 million trajectories and 9,500 hours of interaction data, and develop a dual-stream perception architecture that integrates semantic knowledge from vision-language models with 3D geometric priors, enabling end-to-end policy learning via a DiT backbone. Experiments demonstrate that each component is independently effective and collectively yields significant, additive performance gains, markedly improving cross-platform and cross-task generalization across six public benchmarks.

Technology Category

Application Category

📝 Abstract

Building general-purpose embodied agents across diverse hardware remains a central challenge in robotics, often framed as the''one-brain, many-forms''paradigm. Progress is hindered by fragmented data, inconsistent representations, and misaligned training objectives. We present ABot-M0, a framework that builds a systematic data curation pipeline while jointly optimizing model architecture and training strategies, enabling end-to-end transformation of heterogeneous raw data into unified, efficient representations. From six public datasets, we clean, standardize, and balance samples to construct UniACT-dataset, a large-scale dataset with over 6 million trajectories and 9,500 hours of data, covering diverse robot morphologies and task scenarios. Unified pre-training improves knowledge transfer and generalization across platforms and tasks, supporting general-purpose embodied intelligence. To improve action prediction efficiency and stability, we propose the Action Manifold Hypothesis: effective robot actions lie not in the full high-dimensional space but on a low-dimensional, smooth manifold governed by physical laws and task constraints. Based on this, we introduce Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly. This shifts learning from denoising to projection onto feasible manifolds, improving decoding speed and policy stability. ABot-M0 supports modular perception via a dual-stream mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules such as VGGT and Qwen-Image-Edit, enhancing spatial understanding without modifying the backbone and mitigating standard VLM limitations in 3D reasoning. Experiments show components operate independently with additive benefits. We will release all code and pipelines for reproducibility and future research.

Problem

Research questions and friction points this paper is trying to address.

embodied agents

robotic manipulation

general-purpose intelligence

action prediction

cross-platform generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Action Manifold Learning

VLA Foundation Model

Unified Robotic Dataset