🤖 AI Summary
This work aims to enhance neural novel view synthesis in terms of geometric consistency, real-time performance, and generalization—without relying on explicit 3D reconstruction. To this end, we propose a 3D-aware implicit feature-driven encoder-decoder architecture: the encoder is initialized with a pretrained 3D reconstruction network to inject strong geometric priors, while a lightweight decoder is trained end-to-end using photometric loss. The method maintains a fully neural pipeline yet achieves geometrically plausible and efficient real-time rendering, and seamlessly extends to diffusion models for generative view extrapolation. It attains state-of-the-art performance with 31.4 PSNR on RealEstate10K, supports high-quality synthesis under both known and unknown camera poses, and demonstrates excellent generalization to in-the-wild scenes.
📝 Abstract
Recent work has shown that neural networks can perform 3D tasks such as Novel View Synthesis (NVS) without explicit 3D reconstruction. Even so, we argue that strong 3D inductive biases are still helpful in the design of such networks. We show this point by introducing LagerNVS, an encoder-decoder neural network for NVS that builds on `3D-aware' latent features. The encoder is initialized from a 3D reconstruction network pre-trained using explicit 3D supervision. This is paired with a lightweight decoder, and trained end-to-end with photometric losses. LagerNVS achieves state-of-the-art deterministic feed-forward Novel View Synthesis (including 31.4 PSNR on Re10k), with and without known cameras, renders in real time, generalizes to in-the-wild data, and can be paired with a diffusion decoder for generative extrapolation.