CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing multi-view diffusion models are constrained by non-autoregressive architectures, supporting only fixed input views, suffering from slow inference, and lacking generalization to complex camera trajectories for world modeling. This paper proposes CausNVS—the first autoregressive multi-view diffusion model—enabling flexible novel view synthesis under arbitrary input/output view configurations via causal masking during training, frame-wise noise injection, and relative pose encoding (CaPE). To ensure spatiotemporal coherence while mitigating cumulative drift, we introduce a spatially aware sliding window and key-value caching mechanism, significantly improving inference efficiency. Experiments demonstrate that CausNVS generates high-fidelity, temporally consistent multi-view outputs across diverse scenes. Our approach establishes a scalable, efficient, and robust paradigm for dynamic world modeling.

Technology Category

Application Category

📝 Abstract

Multi-view diffusion models have shown promise in 3D novel view synthesis, but most existing methods adopt a non-autoregressive formulation. This limits their applicability in world modeling, as they only support a fixed number of views and suffer from slow inference due to denoising all frames simultaneously. To address these limitations, we propose CausNVS, a multi-view diffusion model in an autoregressive setting, which supports arbitrary input-output view configurations and generates views sequentially. We train CausNVS with causal masking and per-frame noise, using pairwise-relative camera pose encodings (CaPE) for precise camera control. At inference time, we combine a spatially-aware sliding-window with key-value caching and noise conditioning augmentation to mitigate drift. Our experiments demonstrate that CausNVS supports a broad range of camera trajectories, enables flexible autoregressive novel view synthesis, and achieves consistently strong visual quality across diverse settings. Project page: https://kxhit.github.io/CausNVS.html.

Problem

Research questions and friction points this paper is trying to address.

Enables flexible autoregressive novel view synthesis

Supports arbitrary input-output view configurations

Mitigates drift with spatial-aware sliding window

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive multi-view diffusion for flexible synthesis

Causal masking with pairwise-relative camera pose encodings

Spatially-aware sliding-window with key-value caching

🔎 Similar Papers

ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis