Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

📅 2026-02-05

📈 Citations: 1

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the challenge of “Beyond Visual Navigation” (BVN)—navigating over long distances in previously unseen environments using only high-level language instructions. To this end, it introduces video generation models into BVN for the first time, proposing a sparse future frame generation strategy that maintains a 20-second planning horizon while reducing inference latency to sub-second levels. By leveraging language-vision alignment and long-horizon supervised learning, the method enables zero-shot deployment in real-world environments and demonstrates robust navigation performance even under complex nighttime conditions. Experimental results show that the approach achieves a success rate 2.5 times higher than existing large language model (LLM) baselines and accelerates inference by a factor of 27.

Technology Category

Application Category

📝 Abstract

Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art LLM baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation

Beyond-the-View Navigation

Sparse Video Generation

Long-horizon Planning

Real-world Autonomy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Video Generation

Beyond-the-View Navigation

Long-horizon Supervision