YOPO-Nav: Visual Navigation using 3DGS Graphs from One-Pass Videos

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
To address the high computational cost of 3D mapping in large-scale real-world visual navigation, this paper proposes a lightweight, map-free navigation paradigm. Methodologically, it constructs a graph-structured spatial representation solely from a single exploratory video using 3D Gaussian Splatting (3DGS), then integrates coarse visual place recognition (VPR) with fine-grained 3DGS-driven pose estimation to enable end-to-end action prediction and closed-loop robot control. Key contributions include: (i) the first “one-video modeling” paradigm; (ii) replacing conventional metric maps with a 3DGS-based graph representation; and (iii) a hierarchical localization-alignment-navigation architecture. Evaluated on our newly introduced YOPO-Campus dataset—a real-world campus benchmark comprising 4 hours of video and 6 km of teleoperated trajectories—as well as on a Clearpath Jackal robot platform, the method significantly outperforms existing map-free visual navigation approaches and enables robust image-goal navigation.

Technology Category

Application Category

📝 Abstract
Visual navigation has emerged as a practical alternative to traditional robotic navigation pipelines that rely on detailed mapping and path planning. However, constructing and maintaining 3D maps is often computationally expensive and memory-intensive. We address the problem of visual navigation when exploration videos of a large environment are available. The videos serve as a visual reference, allowing a robot to retrace the explored trajectories without relying on metric maps. Our proposed method, YOPO-Nav (You Only Pass Once), encodes an environment into a compact spatial representation composed of interconnected local 3D Gaussian Splatting (3DGS) models. During navigation, the framework aligns the robot's current visual observation with this representation and predicts actions that guide it back toward the demonstrated trajectory. YOPO-Nav employs a hierarchical design: a visual place recognition (VPR) module provides coarse localization, while the local 3DGS models refine the goal and intermediate poses to generate control actions. To evaluate our approach, we introduce the YOPO-Campus dataset, comprising 4 hours of egocentric video and robot controller inputs from over 6 km of human-teleoperated robot trajectories. We benchmark recent visual navigation methods on trajectories from YOPO-Campus using a Clearpath Jackal robot. Experimental results show YOPO-Nav provides excellent performance in image-goal navigation for real-world scenes on a physical robot. The dataset and code will be made publicly available for visual navigation and scene representation research.
Problem

Research questions and friction points this paper is trying to address.

Visual navigation without expensive 3D maps using exploration videos.
Encoding environments into compact 3DGS graphs for trajectory retracing.
Aligning robot observations with 3DGS graphs to predict control actions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Encodes environment into interconnected local 3D Gaussian Splatting models
Uses hierarchical visual place recognition for coarse localization
Aligns current visual observation with 3DGS representation to predict actions
🔎 Similar Papers
No similar papers found.