RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address error accumulation in autoregressive modeling for long-horizon robotic manipulation video generation, this paper proposes a non-autoregressive framework. First, high-level tasks are decomposed into atomic subtasks, and semantically consistent keyframes are generated. Then, a diffusion model interpolates between these keyframes to synthesize long video sequences. A semantic-preserving attention mechanism is introduced to ensure cross-frame semantic coherence, and a lightweight video-to-joint-state policy regression model is designed for end-to-end controllable execution. Evaluated on two benchmarks, the method significantly improves video fidelity, task consistency, and executability. It achieves state-of-the-art performance in both video generation quality and policy transfer capability.

Technology Category

Application Category

📝 Abstract
We address the problem of generating long-horizon videos for robotic manipulation tasks. Text-to-video diffusion models have made significant progress in photorealism, language understanding, and motion generation but struggle with long-horizon robotic tasks. Recent works use video diffusion models for high-quality simulation data and predictive rollouts in robot planning. However, these works predict short sequences of the robot achieving one task and employ an autoregressive paradigm to extend to the long horizon, leading to error accumulations in the generated video and in the execution. To overcome these limitations, we propose a novel pipeline that bypasses the need for autoregressive generation. We achieve this through a threefold contribution: 1) we first decompose the high-level goals into smaller atomic tasks and generate keyframes aligned with these instructions. A second diffusion model then interpolates between each of the two generated frames, achieving the long-horizon video. 2) We propose a semantics preserving attention module to maintain consistency between the keyframes. 3) We design a lightweight policy model to regress the robot joint states from generated videos. Our approach achieves state-of-the-art results on two benchmarks in video quality and consistency while outperforming previous policy models on long-horizon tasks.
Problem

Research questions and friction points this paper is trying to address.

Generating long-horizon videos for robotic manipulation tasks
Overcoming error accumulation in autoregressive video generation
Maintaining video consistency and quality in multi-task execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes goals into atomic tasks for keyframes
Uses semantics preserving attention for consistency
Lightweight policy model regresses joint states
🔎 Similar Papers
No similar papers found.
L
Liudi Yang
Department of Computer Science, University of Freiburg, Germany
Y
Yang Bai
Ludwig Maximilian University of Munich, Germany
George Eskandar
George Eskandar
University of Stuttgart
Computer VisionDomain AdaptationGenerative AIAutonomous Driving3D Reconsruction
F
Fengyi Shen
Technical University of Munich, Germany
M
Mohammad Altillawi
Huawei Munich Research Center, Germany
D
Dong Chen
Huawei Munich Research Center, Germany
S
Soumajit Majumder
Huawei Munich Research Center, Germany
Ziyuan Liu
Ziyuan Liu
Unknown affiliation
RoboticsManipulation and GraspingComputer VisionMachine Learning
Gitta Kutyniok
Gitta Kutyniok
Bavarian AI Chair for Mathematical Foundations of Artificial Intelligence, LMU Munich
Applied Harmonic AnalysisArtificial IntelligenceData ScienceImaging ScienceInverse Problems
Abhinav Valada
Abhinav Valada
Professor & Director of Robot Learning Lab, University of Freiburg
RoboticsMachine LearningComputer VisionArtificial Intelligence