Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering

📅 2024-07-30
🏛️ IEEE Transactions on Pattern Analysis and Machine Intelligence
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 2D unsupervised object-centric representation learning methods struggle to capture intrinsic 3D geometry and motion in dynamic scenes. To address this, we propose DynaVol-S—the first joint optimization framework integrating object-centric voxelization with canonical-space deformation, driven by differentiable volumetric rendering of compositional Neural Radiance Fields (NeRFs). It enables disentangled learning of geometry, semantics, and motion within a 3D voxel space. Crucially, DynaVol-S supports native 3D operations—including geometric editing and trajectory manipulation—surpassing the inherent limitations of 2D approaches. Quantitatively, it achieves state-of-the-art performance on novel-view synthesis and unsupervised object decomposition. Moreover, it demonstrates strong robustness and generalization on real-world dynamic scenes involving complex object interactions.

Technology Category

Application Category

📝 Abstract
Learning object-centric representations from unsupervised videos is challenging. Unlike most previous approaches that focus on decomposing 2D images, we present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning within a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene, which infers per-object occupancy probabilities at individual spatial locations. These voxel features evolve through a canonical-space deformation function and are optimized in an inverse rendering pipeline with a compositional NeRF. Additionally, our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids. DynaVol-S significantly outperforms existing models in both novel view synthesis and unsupervised decomposition tasks for dynamic scenes. By jointly considering geometric structures and semantic features, it effectively addresses challenging real-world scenarios involving complex object interactions. Furthermore, once trained, the explicitly meaningful voxel features enable additional capabilities that 2D scene decomposition methods cannot achieve, such as novel scene generation through editing geometric shapes or manipulating the motion trajectories of objects.
Problem

Research questions and friction points this paper is trying to address.

3D dynamic scene understanding
object-centric voxelization
unsupervised learning from videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D generative model DynaVol-S
Object-centric voxelization technique
Compositional NeRF rendering pipeline
🔎 Similar Papers
No similar papers found.
Yanpeng Zhao
Yanpeng Zhao
University of Edinburgh
Natural Language Understanding
Y
Yiwei Hao
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China
S
Siyu Gao
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China
Y
Yunbo Wang
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China
X
Xiaokang Yang
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China