🤖 AI Summary
Current navigation foundation models rely on offline data training, lacking causal reasoning about action consequences and counterfactual adaptation—thus failing to meet real-world urban requirements such as obstacle avoidance and pedestrian yielding. To address this, we propose the Seeing-to-Experiencing framework, which integrates large-scale video pretraining with reinforcement learning (RL) post-training to shift from passive perception to active interaction. We introduce an anchor-guided distribution matching strategy and a residual attention module to stabilize optimization and enhance responsiveness while preserving pretraining knowledge. Furthermore, we construct NavBench-GS—the first end-to-end evaluation benchmark built upon 3D Gaussian Splatting (3DGS)-based real-scene reconstruction. Experiments demonstrate that our approach significantly mitigates diminishing returns in offline data scaling, outperforming supervised fine-tuning in both generalization and safety. This validates the critical role of online interactive experience in advancing navigation foundation models.
📝 Abstract
Navigation foundation models trained on massive webscale data enable agents to generalize across diverse environments and embodiments. However, these models trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in the real-world urban navigation where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pre-training on videos and post-training through RL. It maintains the generalizability acquired from large-scale real-world videos while enhancing its interactivity through RL in simulation environments. Specifically, we introduce two innovations: an Anchor-Guided Distribution Matching strategy, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and a Residual-Attention Module, which obtains reactive behaviors from simulation environments without erasing the model's pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3DGS reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models. Extensive experiments show that S2E mitigates the diminishing returns often seen when scaling with offline data alone. We perform a thorough analysis of the benefits of Reinforcement Learning compared to Supervised Fine-Tuning in the context of post-training for robot learning. Our findings emphasize the crucial role of integrating interactive online experiences to effectively scale foundation models in Robotics.