VideoAgent: Self-Improving Video Generation

📅 2024-10-14
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation methods suffer from hallucinated content and physically implausible dynamics when applied to robotic visual planning, leading to ineffective control actions and low task success rates. To address this, we propose a self-optimizing video generation framework grounded in closed-loop environmental feedback: it introduces a self-conditioned consistency mechanism that dynamically calibrates the generative process during inference; integrates video diffusion models with language–vision–action multimodal alignment, reinforcement-style interactive feedback, and real-time self-correction strategies; and supports online data collection and iterative model refinement on real robots. This work presents the first end-to-end, closed-loop feedback-driven optimization of video generation for robotics. Experiments on MetaWorld and iTHOR show significant reductions in hallucination rates and substantial improvements in downstream manipulation task success. We further demonstrate successful online fine-tuning of video generation on a physical robot platform.

Technology Category

Application Category

📝 Abstract
Video generation has been used to generate visual plans for controlling robotic systems. Given an image observation and a language instruction, previous work has generated video plans which are then converted to robot controls to be executed. However, a major bottleneck in leveraging video generation for control lies in the quality of the generated videos, which often suffer from hallucinatory content and unrealistic physics, resulting in low task success when control actions are extracted from the generated videos. While scaling up dataset and model size provides a partial solution, integrating external feedback is both natural and essential for grounding video generation in the real world. With this observation, we propose VideoAgent for self-improving generated video plans based on external feedback. Instead of directly executing the generated video plan, VideoAgent first refines the generated video plans using a novel procedure which we call self-conditioning consistency, allowing inference-time compute to be turned into better generated video plans. As the refined video plan is being executed, VideoAgent can collect additional data from the environment to further improve video plan generation. Experiments in simulated robotic manipulation from MetaWorld and iTHOR show that VideoAgent drastically reduces hallucination, thereby boosting success rate of downstream manipulation tasks. We further illustrate that VideoAgent can effectively refine real-robot videos, providing an early indicator that robots can be an effective tool in grounding video generation in the physical world. Video demos and code can be found at https://video-as-agent.github.io.
Problem

Research questions and friction points this paper is trying to address.

Improve video generation quality
Reduce hallucinatory content
Enhance robotic task success
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-conditioning consistency technique
External feedback integration
Real-robot video refinement
🔎 Similar Papers
No similar papers found.