🤖 AI Summary
This work proposes a unified navigation framework that overcomes the limitations of traditional robot navigation—typically reliant on precise waypoints and unable to interpret natural language instructions or generalize across diverse tasks and platforms. The approach leverages a generative video model to translate natural language commands into visual descriptions, which then guide the model to “imagine” the execution of the task by generating a video sequence. From this imagined trajectory, executable motion plans are extracted. Notably, this is the first method to employ a generative video model as a planning engine, enabling task-agnostic behavior generation through “dream-like” simulation. Integrating the Qwen 2.5-VL-7B-Instruct multimodal large language model, NVIDIA Cosmos 2.5 video generation model, and visual pose estimation, the system achieves a 76.7% task success rate on both wheeled and quadruped robots, with goal errors of 0.05–0.10 meters and trajectory tracking errors below 0.15 meters.
📝 Abstract
We present DreamToNav, a novel autonomous robot framework that uses generative video models to enable intuitive, human-in-the-loop control. Instead of relying on rigid waypoint navigation, users provide natural language prompts (e.g. ``Follow the person carefully''), which the system translates into executable motion. Our pipeline first employs Qwen 2.5-VL-7B-Instruct to refine vague user instructions into precise visual descriptions. These descriptions condition NVIDIA Cosmos 2.5, a state-of-the-art video foundation model, to synthesize a physically consistent video sequence of the robot performing the task. From this synthetic video, we extract a valid kinematic path using visual pose estimation, robot detection and trajectory recovery. By treating video generation as a planning engine, DreamToNav allows robots to visually"dream"complex behaviors before executing them, providing a unified framework for obstacle avoidance and goal-directed navigation without task-specific engineering. We evaluate the approach on both a wheeled mobile robot and a quadruped robot in indoor navigation tasks. DreamToNav achieves a success rate of 76.7%, with final goal errors typically within 0.05-0.10 m and trajectory tracking errors below 0.15 m. These results demonstrate that trajectories extracted from generative video predictions can be reliably executed on physical robots across different locomotion platforms.