HoloBrain-0 Technical Report

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work proposes a lightweight vision-language-action (VLA) framework to bridge the gap between foundation models and reliable robotic deployment, explicitly integrating robot embodiment priors—such as multi-view camera parameters and URDF-based kinematics—into the VLA architecture for the first time. Leveraging a two-stage “pre-training + post-training” paradigm, the model achieves performance on par with significantly larger counterparts despite having only 0.2 billion parameters. It attains state-of-the-art results on simulation benchmarks including RoboTwin 2.0, LIBERO, and GenieSim, while demonstrating strong 3D spatial reasoning, cross-morphology adaptability, and low-latency edge deployment in real-world long-horizon tasks. The complete toolchain is publicly released.

Technology Category

Application Category

📝 Abstract

In this work, we introduce HoloBrain-0, a comprehensive Vision-Language-Action (VLA) framework that bridges the gap between foundation model research and reliable real-world robot deployment. The core of our system is a novel VLA architecture that explicitly incorporates robot embodiment priors, including multi-view camera parameters and kinematic descriptions (URDF), to enhance 3D spatial reasoning and support diverse embodiments. We validate this design through a scalable ``pre-train then post-train"paradigm, achieving state-of-the-art results on simulation benchmarks such as RoboTwin 2.0, LIBERO, and GenieSim, as well as strong results on challenging long-horizon real-world manipulation tasks. Notably, our efficient 0.2B-parameter variant rivals significantly larger baselines, enabling low-latency on-device deployment. To further accelerate research and practical adoption, we fully open-source the entire HoloBrain ecosystem, which includes: (1) powerful pre-trained VLA foundations; (2) post-trained checkpoints for multiple simulation suites and real-world tasks; and (3) RoboOrchard, a full-stack VLA infrastructure for data curation, model training and deployment. Together with standardized data collection protocols, this release provides the community with a complete, reproducible path toward high-performance robotic manipulation.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

robot embodiment

3D spatial reasoning

real-world robot deployment

long-horizon manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)

embodiment priors

3D spatial reasoning