π€ AI Summary
This work addresses the challenges of geometric-motion coupling and sparse or constrained outputs in monocular video-based 4D reconstruction by proposing 4RC, a unified feedforward framework. 4RC introduces a novel βencode once, query anywhere in space-timeβ paradigm that decouples 4D attributes into a static base geometry and time-varying relative motion. Leveraging a Transformer-based spatiotemporal encoder and a conditional query decoder, the method enables end-to-end learning of dense 4D representations. It supports high-fidelity querying of geometry and motion at arbitrary frames and continuous time instants, achieving state-of-the-art performance across multiple 4D reconstruction benchmarks, significantly outperforming both existing and concurrent approaches.
π Abstract
We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos. Unlike existing approaches that typically decouple motion from geometry or produce limited 4D attributes such as sparse trajectories or two-view scene flow, 4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics. At its core, 4RC introduces a novel encode-once, query-anywhere and anytime paradigm: a transformer backbone encodes the entire video into a compact spatio-temporal latent space, from which a conditional decoder can efficiently query 3D geometry and motion for any query frame at any target timestamp. To facilitate learning, we represent per-view 4D attributes in a minimally factorized form by decomposing them into base geometry and time-dependent relative motion. Extensive experiments demonstrate that 4RC outperforms prior and concurrent methods across a wide range of 4D reconstruction tasks.