🤖 AI Summary
Existing methods for editable representations of high-resolution dynamic scenes struggle to balance editability with accurate modeling of complex occlusions: neural maps support 2D editing but suffer from multi-object occlusion ambiguities; scene graph models capture 3D spatial relationships yet lack view-consistent appearance editing. To address this, we propose Neural Atlas Graphs (NAGs), where each node is a view-dependent neural atlas—integrating the 2D editability of neural maps with the 3D relational reasoning of graph structures—enabling high-fidelity, view-consistent editing without explicit annotations. NAGs jointly optimize deformable image layers and neural radiance fields during inference for reconstruction and editing. On the Waymo Open Dataset, our method achieves a 5 dB PSNR gain over state-of-the-art; on DAVIS video editing benchmarks, it improves PSNR by over 7 dB. NAGs support high-resolution environmental modifications and synthesis of photorealistic virtual driving scenes.
📝 Abstract
Learning editable high-resolution scene representations for dynamic scenes is an open problem with applications across the domains from autonomous driving to creative editing - the most successful approaches today make a trade-off between editability and supporting scene complexity: neural atlases represent dynamic scenes as two deforming image layers, foreground and background, which are editable in 2D, but break down when multiple objects occlude and interact. In contrast, scene graph models make use of annotated data such as masks and bounding boxes from autonomous-driving datasets to capture complex 3D spatial relationships, but their implicit volumetric node representations are challenging to edit view-consistently. We propose Neural Atlas Graphs (NAGs), a hybrid high-resolution scene representation, where every graph node is a view-dependent neural atlas, facilitating both 2D appearance editing and 3D ordering and positioning of scene elements. Fit at test-time, NAGs achieve state-of-the-art quantitative results on the Waymo Open Dataset - by 5 dB PSNR increase compared to existing methods - and make environmental editing possible in high resolution and visual quality - creating counterfactual driving scenarios with new backgrounds and edited vehicle appearance. We find that the method also generalizes beyond driving scenes and compares favorably - by more than 7 dB in PSNR - to recent matting and video editing baselines on the DAVIS video dataset with a diverse set of human and animal-centric scenes.