ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes ABot-N0, the first vision-language-action (VLA) foundation model capable of unifying five distinct embodied navigation tasks: point-goal, object-goal, instruction-following, interest-point navigation, and person-following. Addressing the limitation of existing approaches that rely on task-specific architectures, ABot-N0 introduces a hierarchical “Brain-Action” framework that integrates large language model–driven cognitive reasoning with flow-matching action experts to generate continuous navigation trajectories. The model is powered by a large-scale data engine comprising 16.9 million expert trajectories. Evaluated across seven benchmarks, ABot-N0 achieves state-of-the-art performance, significantly outperforming specialized methods for each individual task and marking the first successful realization of a unified, general-purpose embodied navigation system.

Technology Category

Application Category

📝 Abstract
Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a ``Grand Unification''across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical ``Brain-Action''architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation. To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 $\text{km}^2$). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.
Problem

Research questions and friction points this paper is trying to address.

Embodied Navigation
Task-specific Architectures
Unified Model
Vision-Language-Action
Foundation Model
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)
Flow Matching
Hierarchical Brain-Action Architecture
Embodied Navigation
Foundation Model
Z
Zedong Chu
AMAP CV Lab, Alibaba Group
Shichao Xie
Shichao Xie
Autonavi, alibaba group
computer visionslamvio
Xiaolong Wu
Xiaolong Wu
Georgia Institute of Technology
SLAMLocalizationRobotics
Y
Yanfen Shen
AMAP CV Lab, Alibaba Group
M
Minghua Luo
AMAP CV Lab, Alibaba Group
Zhengbo Wang
Zhengbo Wang
University of Science and Technology of China
computer vision
F
Fei Liu
AMAP CV Lab, Alibaba Group
X
Xiaoxu Leng
AMAP CV Lab, Alibaba Group
J
Junjun Hu
AMAP CV Lab, Alibaba Group
M
Mingyang Yin
AMAP CV Lab, Alibaba Group
Jia Lu
Jia Lu
Professor of Journalism and Communication, Tsinghua University
New ICTs and Social Change
Y
Yingnan Guo
AMAP CV Lab, Alibaba Group
Kai Yang
Kai Yang
Ant Group
Artificial Intelligence
Jiawei Han
Jiawei Han
Abel Bliss Professor of Computer Science, University of Illinois
data miningdatabase systemsdata warehousinginformation networks
X
Xu Chen
AMAP CV Lab, Alibaba Group
Y
Yanqing Zhu
AMAP CV Lab, Alibaba Group
Yuxiang Zhao
Yuxiang Zhao
Shanghai Jiao Tong University
text-to-speechartificial intelligencedeepfake detection
X
Xin Liu
AMAP CV Lab, Alibaba Group
Y
Yirong Yang
AMAP CV Lab, Alibaba Group
Y
Ye He
AMAP CV Lab, Alibaba Group
J
Jiahang Wang
AMAP CV Lab, Alibaba Group
Yang Cai
Yang Cai
Professor of Computer Science and Economics, Yale University
Theoretical Computer ScienceAlgorithmic Game TheoryMechanism DesignLearning
Tianlin Zhang
Tianlin Zhang
CHN Energy Data Center; The University of Manchester
natural language processingBioNLPartificial intelligenceaffective computingmental health
L
Li Gao
AMAP CV Lab, Alibaba Group
L
Liu Liu
AMAP CV Lab, Alibaba Group