🤖 AI Summary
Existing embodied world models are often confined to 2D representations, limiting their capacity for spatial reasoning and suffering from challenges such as scarcity of multi-view paired data, spatiotemporal inconsistencies in 3D geometry, and hallucination of manipulation details. To address these issues, this work proposes a video-to-video 4D world model that leverages 3D-aware synthesis to construct heterogeneous training data. It introduces a confidence-difference-based adaptive noise injection strategy to enforce spatiotemporal consistency and designs an interaction-aware attention mechanism to enhance the realism of manipulated regions. The proposed approach achieves high-fidelity, view-consistent generation of dynamic scenes in multi-view video synthesis tasks, significantly improving downstream performance in robotic planning and learning.
📝 Abstract
World models have made significant progress in modeling dynamic environments; however, most embodied world models are still restricted to 2D representations, lacking the comprehensive multi-view information essential for embodied spatial reasoning. Bridging this gap is non-trivial, primarily due to challenges from severe scarcity of paired multi-view data, the difficulty of maintaining spatiotemporal consistency in generated 3D geometries, and the tendency to hallucinate manipulation details. To address these challenges, we propose Embody4D, a dedicated video-to-video world model for embodied scenarios, capable of synthesizing arbitrary novel views from a monocular video. First, to tackle data scarcity, we introduce a 3D-aware compositional synthesis pipeline to curate a heterogeneous dataset compositing cross-embodiment robotic arms with diverse backgrounds, guaranteeing broad generalization. Second, to enforce geometric stability, we devise an adaptive noise injection strategy; by leveraging confidence disparities across image regions, this method selectively regularizes the diffusion process to ensure strict spatiotemporal consistency. Finally, to guarantee manipulation fidelity, we incorporate an interaction-aware attention mechanism that explicitly attends to the robotic interaction regions. Extensive experiments demonstrate that Embody4D achieves state-of-the-art performance, serving as a robust world model that synthesizes high-fidelity, view-consistent videos to empower downstream robotic planning and learning.