🤖 AI Summary
This work proposes a lightweight vision-language-action (VLA) framework to bridge the gap between foundation models and reliable robotic deployment, explicitly integrating robot embodiment priors—such as multi-view camera parameters and URDF-based kinematics—into the VLA architecture for the first time. Leveraging a two-stage “pre-training + post-training” paradigm, the model achieves performance on par with significantly larger counterparts despite having only 0.2 billion parameters. It attains state-of-the-art results on simulation benchmarks including RoboTwin 2.0, LIBERO, and GenieSim, while demonstrating strong 3D spatial reasoning, cross-morphology adaptability, and low-latency edge deployment in real-world long-horizon tasks. The complete toolchain is publicly released.
📝 Abstract
In this work, we introduce HoloBrain-0, a comprehensive Vision-Language-Action (VLA) framework that bridges the gap between foundation model research and reliable real-world robot deployment. The core of our system is a novel VLA architecture that explicitly incorporates robot embodiment priors, including multi-view camera parameters and kinematic descriptions (URDF), to enhance 3D spatial reasoning and support diverse embodiments. We validate this design through a scalable ``pre-train then post-train"paradigm, achieving state-of-the-art results on simulation benchmarks such as RoboTwin 2.0, LIBERO, and GenieSim, as well as strong results on challenging long-horizon real-world manipulation tasks. Notably, our efficient 0.2B-parameter variant rivals significantly larger baselines, enabling low-latency on-device deployment. To further accelerate research and practical adoption, we fully open-source the entire HoloBrain ecosystem, which includes: (1) powerful pre-trained VLA foundations; (2) post-trained checkpoints for multiple simulation suites and real-world tasks; and (3) RoboOrchard, a full-stack VLA infrastructure for data curation, model training and deployment. Together with standardized data collection protocols, this release provides the community with a complete, reproducible path toward high-performance robotic manipulation.