UMAMI: Unifying Masked Autoregressive Models and Deterministic Rendering for View Synthesis

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Novel view synthesis (NVS) under sparse multi-view inputs faces a fundamental trade-off between rendering quality and efficiency: deterministic methods are fast but produce blurry outputs in occluded regions, while diffusion models yield plausible results at prohibitive computational cost. This paper introduces the first end-to-end hybrid framework that jointly integrates deterministic regression and masked autoregressive diffusion—without relying on hand-crafted 3D priors. Key innovations include: (1) a bidirectional Transformer that jointly encodes image tokens and Plücker ray features; (2) a dual-head decoder—where the deterministic head renders geometrically well-constrained regions and the diffusion head hallucinates occluded or unseen content; and (3) joint optimization via photometric and diffusion losses. Our method achieves state-of-the-art image quality across diverse scenes while accelerating inference by 10× over full-diffusion baselines.

Technology Category

Application Category

📝 Abstract

Novel view synthesis (NVS) seeks to render photorealistic, 3D-consistent images of a scene from unseen camera poses given only a sparse set of posed views. Existing deterministic networks render observed regions quickly but blur unobserved areas, whereas stochastic diffusion-based methods hallucinate plausible content yet incur heavy training- and inference-time costs. In this paper, we propose a hybrid framework that unifies the strengths of both paradigms. A bidirectional transformer encodes multi-view image tokens and Plucker-ray embeddings, producing a shared latent representation. Two lightweight heads then act on this representation: (i) a feed-forward regression head that renders pixels where geometry is well constrained, and (ii) a masked autoregressive diffusion head that completes occluded or unseen regions. The entire model is trained end-to-end with joint photometric and diffusion losses, without handcrafted 3D inductive biases, enabling scalability across diverse scenes. Experiments demonstrate that our method attains state-of-the-art image quality while reducing rendering time by an order of magnitude compared with fully generative baselines.

Problem

Research questions and friction points this paper is trying to address.

Unify deterministic rendering and stochastic diffusion for novel view synthesis.

Address blurring in unobserved areas and high computational costs.

Render photorealistic images from sparse views with reduced inference time.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid framework unifies deterministic and stochastic rendering

Bidirectional transformer encodes multi-view tokens and ray embeddings

Lightweight heads combine regression and masked autoregressive diffusion

🔎 Similar Papers

ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis