Inferring Compositional 4D Scenes without Ever Seeing One

πŸ“… 2025-12-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing methods rely on category-specific parametric shape models and handle only single static or single dynamic objects, making it challenging to consistently infer the 4D spatiotemporal structure of multi-object scenes and requiring costly 4D compositional supervision. This paper proposes COM4Dβ€”the first method capable of jointly reconstructing full 4D scenes using only static multi-object or dynamic single-object supervision, without any 4D compositional data. Our approach introduces an attention-based mixing mechanism that decouples spatial and temporal modeling: it learns object composition from 2D videos and alternates between spatial scene-structure inference and per-object dynamic modeling to achieve spatiotemporal co-reasoning. COM4D achieves state-of-the-art performance on both 4D object reconstruction and compositional 3D reconstruction benchmarks. Crucially, it is purely data-driven and generates coherent, persistent 4D scenes featuring realistic multi-object interactions.

Technology Category

Application Category

πŸ“ Abstract
Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard. Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, and single object dynamics throughout the video on the other, thus completely avoiding reliance on 4D compositional training data. At inference time, our proposed attention mixing mechanism combines these independently learned attentions, without requiring any 4D composition examples. By alternating between spatial and temporal reasoning, COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos. Furthermore, COM4D provides state-of-the-art results in existing separate problems of 4D object and composed 3D reconstruction despite being purely data-driven.
Problem

Research questions and friction points this paper is trying to address.

Reconstructs 4D scenes from monocular videos without 4D training data.
Predicts structure and spatio-temporal configuration of multiple interacting objects.
Avoids reliance on category-specific models for dynamic object reconstruction.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses spatial and temporal attention on 2D video input
Disentangled training avoids 4D compositional data reliance
Attention mixing reconstructs 4D scenes from monocular videos
πŸ”Ž Similar Papers