LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

📅 2024-10-22

🏛️ arXiv.org

📈 Citations: 16

✨ Influential: 4

career value

224K/year

🤖 AI Summary

To address novel view synthesis under sparse-view settings, this work proposes two purely data-driven Transformer architectures devoid of any explicit 3D inductive bias: an encoder-decoder LVSM that learns compact 1D scene representations, and a decoder-only LVSM enabling end-to-end view mapping. Crucially, the approach abandons all explicit 3D priors (e.g., NeRF, 3D Gaussian Splatting) and geometry-aware constraints (e.g., epipolar projection, plane sweeping). Instead, it leverages image tokenization, 1D latent compression, and full-attention cross-view modeling to achieve zero-shot generalization for novel view synthesis—first demonstrated in this setting. On multiple benchmarks, our method surpasses state-of-the-art methods by 1.5–3.5 dB in PSNR. Training and efficient inference require only 1–2 GPUs, achieving an exceptional balance between speed and reconstruction quality.

Technology Category

Application Category

📝 Abstract

We propose the Large View Synthesis Model (LVSM), a novel transformer-based approach for scalable and generalizable novel view synthesis from sparse-view inputs. We introduce two architectures: (1) an encoder-decoder LVSM, which encodes input image tokens into a fixed number of 1D latent tokens, functioning as a fully learned scene representation, and decodes novel-view images from them; and (2) a decoder-only LVSM, which directly maps input images to novel-view outputs, completely eliminating intermediate scene representations. Both models bypass the 3D inductive biases used in previous methods -- from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps) -- addressing novel view synthesis with a fully data-driven approach. While the encoder-decoder model offers faster inference due to its independent latent representation, the decoder-only LVSM achieves superior quality, scalability, and zero-shot generalization, outperforming previous state-of-the-art methods by 1.5 to 3.5 dB PSNR. Comprehensive evaluations across multiple datasets demonstrate that both LVSM variants achieve state-of-the-art novel view synthesis quality. Notably, our models surpass all previous methods even with reduced computational resources (1-2 GPUs). Please see our website for more details: https://haian-jin.github.io/projects/LVSM/ .

Problem

Research questions and friction points this paper is trying to address.

Develops scalable novel view synthesis from sparse inputs

Eliminates 3D inductive biases in view synthesis methods

Achieves state-of-the-art quality with minimal computational resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based model for novel view synthesis

Encoder-decoder and decoder-only architectures

Eliminates 3D inductive biases completely

🔎 Similar Papers

No similar papers found.