BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of detail inconsistency and artifacts in novel view synthesis from extremely sparse and unconstrained real-world scenes. We propose a novel approach that leverages pretrained Stable Video Diffusion (SVD) as the backbone, incorporating features rendered via 3D Gaussian Splatting as input. Crucially, we introduce—within the SVD’s variational autoencoder (VAE) module—a joint regularization combining temporal equivariance and alignment with representations from vision foundation models. This design departs from prior paradigms that fine-tune only the UNet, significantly enhancing temporal consistency and detail fidelity in generated views. Experimental results on the DL3DV-10K dataset demonstrate that our method substantially outperforms current state-of-the-art approaches, producing higher-quality, more coherent, and artifact-free novel views.

Technology Category

Application Category

📝 Abstract
We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes using extremely sparse, unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the variational autoencoder (VAE) module within the SVD pipeline. BetterScene integrates a feed-forward 3D Gaussian Splatting (3DGS) model to render features as inputs for the SVD enhancer and generate continuous, artifact-free, consistent novel views. We evaluate on the challenging DL3DV-10K dataset and demonstrate superior performance compared to state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Novel View Synthesis
Sparse Input
View Consistency
Artifact Reduction
3D Scene Synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal equivariance regularization
vision foundation model-aligned representation
3D Gaussian Splatting
novel view synthesis
Stable Video Diffusion