VLD: Visual Language Goal Distance for Reinforcement Learning Navigation

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
End-to-end visual navigation faces two major bottlenecks: significant simulation-to-reality gap and scarcity of action-labeled data. To address these, we propose the Visual-Language Distance (VLD) learning framework, which decouples perception and policy learning. First, leveraging internet-scale video data, we train a generalizable distance predictor—capable of estimating distances to both image- and text-specified goals—via self-supervised learning, enhanced by an ordinal consistency constraint to improve geometric plausibility. Second, this learned distance signal serves as a dense reward for reinforcement learning to optimize navigation policies. Our method integrates vision-language models, geometric distance distillation, and noise-robust training. Experiments demonstrate that VLD achieves competitive navigation performance in simulation compared to state-of-the-art methods, while significantly improving cross-modal (image/text) goal generalization and system scalability.

Technology Category

Application Category

📝 Abstract
Training end-to-end policies from image data to directly predict navigation actions for robotic systems has proven inherently difficult. Existing approaches often suffer from either the sim-to-real gap during policy transfer or a limited amount of training data with action labels. To address this problem, we introduce Vision-Language Distance (VLD) learning, a scalable framework for goal-conditioned navigation that decouples perception learning from policy learning. Instead of relying on raw sensory inputs during policy training, we first train a self-supervised distance-to-goal predictor on internet-scale video data. This predictor generalizes across both image- and text-based goals, providing a distance signal that can be minimized by a reinforcement learning (RL) policy. The RL policy can be trained entirely in simulation using privileged geometric distance signals, with injected noise to mimic the uncertainty of the trained distance predictor. At deployment, the policy consumes VLD predictions, inheriting semantic goal information-"where to go"-from large-scale visual training while retaining the robust low-level navigation behaviors learned in simulation. We propose using ordinal consistency to assess distance functions directly and demonstrate that VLD outperforms prior temporal distance approaches, such as ViNT and VIP. Experiments show that our decoupled design achieves competitive navigation performance in simulation while supporting flexible goal modalities, providing an alternative and, most importantly, scalable path toward reliable, multimodal navigation policies.
Problem

Research questions and friction points this paper is trying to address.

Addresses sim-to-real gap in robotic navigation training
Reduces reliance on limited action-labeled training data
Enables multimodal goal specification for navigation policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised distance predictor on internet-scale video data
Decouples perception learning from policy learning
Uses ordinal consistency to assess distance functions
🔎 Similar Papers
L
Lazar Milikic
ETH Zurich, EPFL
Manthan Patel
Manthan Patel
Robotic Systems Lab, ETH Zurich
Computer VisionNavigationSLAMField Robotics
J
Jonas Frey
ETH Zurich, Stanford University, UC Berkeley