Rethinking Video Generation Model for the Embodied World

📅 2026-01-21

📈 Citations: 2

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the significant limitations of existing video generation models in synthesizing physically plausible, high-quality videos that accurately reflect robotic interactions, as well as the absence of standardized evaluation benchmarks. To this end, we propose an embodied intelligence–oriented video generation evaluation framework, introducing RBench—a standardized benchmark encompassing five task categories and four embodied agent morphologies—and release RoVid-X, the largest open-source robot video dataset to date, comprising 4 million clips with detailed physical attribute annotations. Leveraging a four-stage data processing pipeline and multidimensional reproducible metrics—including structural consistency, physical plausibility, and action completeness—RBench achieves a Spearman correlation coefficient of 0.96 with human evaluations. Comprehensive assessments of 25 state-of-the-art models reveal substantial deficiencies in physical realism, establishing RBench as a reliable benchmark and providing high-quality training data for future research.

Technology Category

Application Category

📝 Abstract

Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high-quality training data. Driven by these insights, we introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with comprehensive physical property annotations. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward general intelligence.

Problem

Research questions and friction points this paper is trying to address.

video generation

embodied intelligence

robotics benchmark

physical realism

training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

RBench

RoVid-X

embodied video generation