🤖 AI Summary
This work addresses the limitations of existing robotic simulation methods, which are often confined to 2D or static environments and fail to capture the intrinsic 4D spatiotemporal interactions between robots and their surroundings. To overcome this, we propose an action-conditioned generative 4D simulation framework that leverages URDF-based kinematics to produce precise 4D robot trajectories, which in turn guide the synchronized generation of geometrically consistent and physically plausible RGB and point cloud sequences. Our approach achieves, for the first time, generative simulation with full 4D spatiotemporal consistency, enabling zero-shot transfer to real-world scenarios. It significantly outperforms current methods in terms of dynamic realism, geometric fidelity, and embodied generalization. We also introduce Robo4D-200k, a large-scale, densely annotated 4D dataset to support future research in this domain.
📝 Abstract
Simulating robot-world interactions is a cornerstone of Embodied AI. Recently, a few works have shown promise in leveraging video generations to transcend the rigid visual/physical constraints of traditional simulators. However, they primarily operate in 2D space or are guided by static environmental cues, ignoring the fundamental reality that robot-world interactions are inherently 4D spatiotemporal events that require precise interactive modeling. To restore this 4D essence while ensuring the precise robot control, we introduce Kinema4D, a new action-conditioned 4D generative robotic simulator that disentangles the robot-world interaction into: i) Precise 4D representation of robot controls: we drive a URDF-based 3D robot via kinematics, producing a precise 4D robot control trajectory. ii) Generative 4D modeling of environmental reactions: we project the 4D robot trajectory into a pointmap as a spatiotemporal visual signal, controlling the generative model to synthesize complex environments' reactive dynamics into synchronized RGB/pointmap sequences. To facilitate training, we curated a large-scale dataset called Robo4D-200k, comprising 201,426 robot interaction episodes with high-quality 4D annotations. Extensive experiments demonstrate that our method effectively simulates physically-plausible, geometry-consistent, and embodiment-agnostic interactions that faithfully mirror diverse real-world dynamics. For the first time, it shows potential zero-shot transfer capability, providing a high-fidelity foundation for advancing next-generation embodied simulation.