Say, Dream, and Act: Learning Video World Models for Instruction-Driven Robot Manipulation

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the inefficiency of current robotic systems stemming from their inability to accurately predict environmental dynamics. While existing vision-language and world models either lack explicit future-state prediction or suffer from short prediction horizons and spatial inconsistency, this paper proposes a video-based world model framework tailored for instruction-driven robotic manipulation. By fine-tuning a high-quality video generation model and incorporating adversarial distillation for efficient multi-step forecasting, the approach enables long-horizon, spatially consistent, and instruction-aligned video prediction. Furthermore, it integrates generated videos with real-world observations to train action policies that correct spatial errors. The method achieves state-of-the-art performance, significantly improving embodied consistency, spatial reference accuracy, and task success rates over existing baselines.

Technology Category

Application Category

📝 Abstract

Robotic manipulation requires anticipating how the environment evolves in response to actions, yet most existing systems lack this predictive capability, often resulting in errors and inefficiency. While Vision-Language Models (VLMs) provide high-level guidance, they cannot explicitly forecast future states, and existing world models either predict only short horizons or produce spatially inconsistent frames. To address these challenges, we propose a framework for fast and predictive video-conditioned action. Our approach first selects and adapts a robust video generation model to ensure reliable future predictions, then applies adversarial distillation for fast, few-step video generation, and finally trains an action model that leverages both generated videos and real observations to correct spatial errors. Extensive experiments show that our method produces temporally coherent, spatially accurate video predictions that directly support precise manipulation, achieving significant improvements in embodiment consistency, spatial referring ability, and task completion over existing baselines. Codes&Models will be released.

Problem

Research questions and friction points this paper is trying to address.

robotic manipulation

world models

video prediction

spatial consistency

future state anticipation

Innovation

Methods, ideas, or system contributions that make the work stand out.

video world models

instruction-driven manipulation

adversarial distillation