MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-based navigation struggles with explicit modeling of egocentric optical observations, necessitating strong models and large-scale training data. Method: We propose MM-Nav, an end-to-end navigation framework grounded in multi-view vision–language–action (VLA) modeling. Its core innovation involves training multiple reinforcement learning (RL) experts using privileged depth information in synthetic environments to generate diverse navigation trajectories; these experts are then distilled into a lightweight student model via a dynamically balanced online multi-expert distillation mechanism. The student integrates a pre-trained large language model and a visual foundation model to form a 360° multi-view VLA architecture. Contribution/Results: Experiments demonstrate significant improvements in success rate, path efficiency, and obstacle avoidance robustness—both in synthetic and real-world settings—along with strong cross-environment generalization capability.

Technology Category

Application Category

📝 Abstract
Visual navigation policy is widely regarded as a promising direction, as it mimics humans by using egocentric visual observations for navigation. However, optical information of visual observations is difficult to be explicitly modeled like LiDAR point clouds or depth maps, which subsequently requires intelligent models and large-scale data. To this end, we propose to leverage the intelligence of the Vision-Language-Action (VLA) model to learn diverse navigation capabilities from synthetic expert data in a teacher-student manner. Specifically, we implement the VLA model, MM-Nav, as a multi-view VLA (with 360 observations) based on pretrained large language models and visual foundation models. For large-scale navigation data, we collect expert data from three reinforcement learning (RL) experts trained with privileged depth information in three challenging tailor-made environments for different navigation capabilities: reaching, squeezing, and avoiding. We iteratively train our VLA model using data collected online from RL experts, where the training ratio is dynamically balanced based on performance on individual capabilities. Through extensive experiments in synthetic environments, we demonstrate that our model achieves strong generalization capability. Moreover, we find that our student VLA model outperforms the RL teachers, demonstrating the synergistic effect of integrating multiple capabilities. Extensive real-world experiments further confirm the effectiveness of our method.
Problem

Research questions and friction points this paper is trying to address.

Developing robust visual navigation using multi-view vision-language-action models
Learning diverse navigation skills from synthetic expert demonstration data
Achieving generalization across reaching, squeezing, and avoiding scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view VLA model with 360-degree observations
Multi-expert learning from three RL specialists
Dynamic training ratio balancing based on performance
T
Tianyu Xu
Peking University
J
Jiawei Chen
Peking University
Jiazhao Zhang
Jiazhao Zhang
Peking University
Embodied AINavigation3D Vision
Wenyao Zhang
Wenyao Zhang
PhD Student, Shanghai Jiaotong University
Robot Learning, Representation Learning
Zekun Qi
Zekun Qi
Tsinghua University
Robotics3D Computer VisionVision Language Model
M
Minghan Li
Galbot
Z
Zhizheng Zhang
Galbot, BAAI
H
He Wang
Peking University, Galbot, BAAI