🤖 AI Summary
This work addresses the challenge of achieving high-quality, conditionally controllable scene-level 3D generation while unifying generative and reconstruction capabilities. The authors propose a novel approach that, for the first time, aligns a pre-trained video diffusion model with a foundational 3D reconstruction model in latent space. By decoupling yet jointly optimizing appearance and geometry latent variables, the method simultaneously generates RGB videos and their corresponding 3D structures—including camera poses, depth maps, and global point clouds. Specifically, an adapter trained on the VGGT reconstruction model produces geometry-aware latent codes, which are then fused with appearance latents from the diffusion model. The approach achieves state-of-the-art performance in both single-image and multi-view conditional 3D scene generation and significantly enhances reconstruction robustness.
📝 Abstract
We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.