🤖 AI Summary
This work addresses the limitations of existing feedforward view synthesis methods, which rely on Plücker ray representations that are highly sensitive to camera coordinate systems, resulting in poor cross-view geometric consistency. To overcome this, the authors propose a projection-conditioning strategy that replaces raw ray inputs with 2D projection cues from the target view, effectively reformulating the task as a stable image-to-image translation problem. A tailored masked autoencoder pretraining mechanism is introduced to leverage large-scale uncalibrated data under this new conditioning paradigm. The proposed approach significantly enhances model robustness and view consistency, achieving state-of-the-art performance across multiple novel view synthesis benchmarks. Notably, it outperforms ray-based baselines by a clear margin on geometric consistency metrics, demonstrating the effectiveness of decoupling geometry representation from explicit ray parameterization.
📝 Abstract
Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Pl\"ucker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on our view-consistency benchmark. It also achieves state-of-the-art quality on standard novel view synthesis benchmarks.