🤖 AI Summary
This work proposes ABot-N0, the first vision-language-action (VLA) foundation model capable of unifying five distinct embodied navigation tasks: point-goal, object-goal, instruction-following, interest-point navigation, and person-following. Addressing the limitation of existing approaches that rely on task-specific architectures, ABot-N0 introduces a hierarchical “Brain-Action” framework that integrates large language model–driven cognitive reasoning with flow-matching action experts to generate continuous navigation trajectories. The model is powered by a large-scale data engine comprising 16.9 million expert trajectories. Evaluated across seven benchmarks, ABot-N0 achieves state-of-the-art performance, significantly outperforming specialized methods for each individual task and marking the first successful realization of a unified, general-purpose embodied navigation system.
📝 Abstract
Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a ``Grand Unification''across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical ``Brain-Action''architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation. To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 $\text{km}^2$). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.