DreamToNav: Generalizable Navigation for Robots via Generative Video Planning

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a unified navigation framework that overcomes the limitations of traditional robot navigation—typically reliant on precise waypoints and unable to interpret natural language instructions or generalize across diverse tasks and platforms. The approach leverages a generative video model to translate natural language commands into visual descriptions, which then guide the model to “imagine” the execution of the task by generating a video sequence. From this imagined trajectory, executable motion plans are extracted. Notably, this is the first method to employ a generative video model as a planning engine, enabling task-agnostic behavior generation through “dream-like” simulation. Integrating the Qwen 2.5-VL-7B-Instruct multimodal large language model, NVIDIA Cosmos 2.5 video generation model, and visual pose estimation, the system achieves a 76.7% task success rate on both wheeled and quadruped robots, with goal errors of 0.05–0.10 meters and trajectory tracking errors below 0.15 meters.

Technology Category

Application Category

📝 Abstract
We present DreamToNav, a novel autonomous robot framework that uses generative video models to enable intuitive, human-in-the-loop control. Instead of relying on rigid waypoint navigation, users provide natural language prompts (e.g. ``Follow the person carefully''), which the system translates into executable motion. Our pipeline first employs Qwen 2.5-VL-7B-Instruct to refine vague user instructions into precise visual descriptions. These descriptions condition NVIDIA Cosmos 2.5, a state-of-the-art video foundation model, to synthesize a physically consistent video sequence of the robot performing the task. From this synthetic video, we extract a valid kinematic path using visual pose estimation, robot detection and trajectory recovery. By treating video generation as a planning engine, DreamToNav allows robots to visually"dream"complex behaviors before executing them, providing a unified framework for obstacle avoidance and goal-directed navigation without task-specific engineering. We evaluate the approach on both a wheeled mobile robot and a quadruped robot in indoor navigation tasks. DreamToNav achieves a success rate of 76.7%, with final goal errors typically within 0.05-0.10 m and trajectory tracking errors below 0.15 m. These results demonstrate that trajectories extracted from generative video predictions can be reliably executed on physical robots across different locomotion platforms.
Problem

Research questions and friction points this paper is trying to address.

robot navigation
natural language instruction
generalizable navigation
autonomous robots
human-in-the-loop control
Innovation

Methods, ideas, or system contributions that make the work stand out.

generative video planning
human-in-the-loop navigation
vision-based trajectory extraction
foundation video model
language-to-motion translation
🔎 Similar Papers
No similar papers found.
Valerii Serpiva
Valerii Serpiva
PhD student, Skolkovo Institute of Science and Technology
RoboticsUAVsAutonomous DronesHuman-Robot Interaction
Jeffrin Sam
Jeffrin Sam
Skolkovo Institute of Science and Technology
RoboticsAIHumnoidsimulation
C
Chidera Simon
Intelligent Space Robotics Laboratory, Skolkovo Institute of Science and Technology, Moscow, Bolshoy Boulevard 30, bld. 1, 121205, Moscow, Russia
H
Hajira Amjad
Intelligent Space Robotics Laboratory, Skolkovo Institute of Science and Technology, Moscow, Bolshoy Boulevard 30, bld. 1, 121205, Moscow, Russia
I
Iana Zhura
Intelligent Space Robotics Laboratory, Skolkovo Institute of Science and Technology, Moscow, Bolshoy Boulevard 30, bld. 1, 121205, Moscow, Russia
Artem Lykov
Artem Lykov
PhD student, Skolkovo Institute of Science and Technology
RoboticsAICognitive roboticsVLA
Dzmitry Tsetserukou
Dzmitry Tsetserukou
Associate Professor, Skolkovo Institute of Science and Technology (Skoltech)
RoboticsHapticsUAV SwarmAIVR