🤖 AI Summary
This work addresses the challenge of geometric and motion inconsistency across views in multi-view video generation. To this end, we propose IC-World, the first framework to systematically explore video world models for shared world modeling. Methodologically, IC-World leverages the contextual generation capability of large-scale video foundation models: given multi-view static images as input, it concurrently synthesizes dynamic video sequences across all views. We further introduce a group-relative policy optimization–based reinforcement learning mechanism, coupled with a novel dual reward model—explicitly enforcing scene-level geometric consistency and object-level motion consistency—for end-to-end optimization. Experiments demonstrate that IC-World significantly outperforms state-of-the-art methods on both geometric and motion consistency metrics, enabling high-fidelity, cross-view coherent dynamic content generation.
📝 Abstract
Video-based world models have recently garnered increasing attention for their ability to synthesize diverse and dynamic visual environments. In this paper, we focus on shared world modeling, where a model generates multiple videos from a set of input images, each representing the same underlying world in different camera poses. We propose IC-World, a novel generation framework, enabling parallel generation for all input images via activating the inherent in-context generation capability of large video models. We further finetune IC-World via reinforcement learning, Group Relative Policy Optimization, together with two proposed novel reward models to enforce scene-level geometry consistency and object-level motion consistency among the set of generated videos. Extensive experiments demonstrate that IC-World substantially outperforms state-of-the-art methods in both geometry and motion consistency. To the best of our knowledge, this is the first work to systematically explore the shared world modeling problem with video-based world models.