🤖 AI Summary
While existing view synthesis Transformers demonstrate strong performance, their computational scaling behavior remains poorly understood. This work systematically investigates their scaling laws and proposes design principles for computationally optimal novel view synthesis models. Through a Transformer-based encoder-decoder architecture (SVSM), comprehensive scaling analysis, and fair training comparisons across varying computational budgets, we demonstrate that a well-designed encoder-decoder structure can achieve computational optimality—challenging the prevailing assumption of its inefficiency. Experiments show that SVSM consistently outperforms prior state-of-the-art methods across multiple compute scales, establishing a superior performance-computation Pareto frontier on real-world novel view synthesis benchmarks while significantly reducing training costs.
📝 Abstract
Geometry-free view synthesis transformers have recently achieved state-of-the-art performance in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. Yet the factors governing their scaling with compute remain unclear. We present a systematic study of scaling laws for view synthesis transformers and derive design principles for training compute-optimal NVS models. Contrary to prior findings, we show that encoder-decoder architectures can be compute-optimal; we trace earlier negative results to suboptimal architectural choices and comparisons across unequal training compute budgets. Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto frontier, and surpasses the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.