🤖 AI Summary
This work addresses the structural inconsistency in text-to-video generation caused by the absence of 3D geometric constraints. While existing approaches incorporate 3D priors through architectural modifications—often at high computational cost and limited scalability—this paper proposes a reinforcement learning framework that requires no alteration to the base generative model. By leveraging the Flow-GRPO algorithm, the method fuses feedback signals from pretrained 3D and vision-language models to enhance geometric consistency during video synthesis. The authors further introduce a novel text-only world simulation dataset and a cycle-decoupled training strategy, which jointly preserve visual motion fluency while substantially improving 3D structural fidelity. This approach effectively bridges the gap between high-quality video generation and scalable world simulation.
📝 Abstract
Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.