Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision

๐Ÿ“… 2025-12-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Monocular foundation models for visual navigation (NFMs) suffer from depth-scale ambiguity and weak geometric reasoning in dynamic, unstructured environments. Method: We propose the first end-to-end NFM framework integrating stereo vision input with plug-and-play mid-level vision modulesโ€”namely, monocular depth estimation and dense optical flow tracking. Stereo input eliminates scale ambiguity, while a novel training paradigm achieves state-of-the-art performance using only 1.5% labeled data. We further introduce the first large-scale, automatically annotated stereo navigation video dataset. Contributions/Results: Our method outperforms prior approaches under full-data supervision; matches their performance using merely 1.5% of the labeled data; and significantly improves navigation success rate and robustness in dynamic scenes. The stereo input enhances geometric consistency and temporal coherence, enabling reliable path planning without explicit metric reconstruction. All components are modular, facilitating seamless integration into existing NFM pipelines. Experimental validation spans diverse indoor and outdoor navigation benchmarks, confirming consistent gains across both static and dynamic settings.

Technology Category

Application Category

๐Ÿ“ Abstract
The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.
Problem

Research questions and friction points this paper is trying to address.

Addresses depth-scale ambiguity in monocular vision for navigation
Enhances geometric understanding in dynamic, unstructured environments
Reduces reliance on large-scale pixel-to-action supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stereo vision resolves depth-scale ambiguity in navigation
Mid-level vision modules provide geometric and motion structure
Combined approach reduces training data needs significantly
๐Ÿ”Ž Similar Papers
No similar papers found.