🤖 AI Summary
Novel view synthesis (NVS) under sparse multi-view inputs faces a fundamental trade-off between rendering quality and efficiency: deterministic methods are fast but produce blurry outputs in occluded regions, while diffusion models yield plausible results at prohibitive computational cost. This paper introduces the first end-to-end hybrid framework that jointly integrates deterministic regression and masked autoregressive diffusion—without relying on hand-crafted 3D priors. Key innovations include: (1) a bidirectional Transformer that jointly encodes image tokens and Plücker ray features; (2) a dual-head decoder—where the deterministic head renders geometrically well-constrained regions and the diffusion head hallucinates occluded or unseen content; and (3) joint optimization via photometric and diffusion losses. Our method achieves state-of-the-art image quality across diverse scenes while accelerating inference by 10× over full-diffusion baselines.
📝 Abstract
Novel view synthesis (NVS) seeks to render photorealistic, 3D-consistent images of a scene from unseen camera poses given only a sparse set of posed views. Existing deterministic networks render observed regions quickly but blur unobserved areas, whereas stochastic diffusion-based methods hallucinate plausible content yet incur heavy training- and inference-time costs. In this paper, we propose a hybrid framework that unifies the strengths of both paradigms. A bidirectional transformer encodes multi-view image tokens and Plucker-ray embeddings, producing a shared latent representation. Two lightweight heads then act on this representation: (i) a feed-forward regression head that renders pixels where geometry is well constrained, and (ii) a masked autoregressive diffusion head that completes occluded or unseen regions. The entire model is trained end-to-end with joint photometric and diffusion losses, without handcrafted 3D inductive biases, enabling scalability across diverse scenes. Experiments demonstrate that our method attains state-of-the-art image quality while reducing rendering time by an order of magnitude compared with fully generative baselines.