UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of robust 3D scene reconstruction from inconsistent multi-view images—characterized by occlusions, motion blur, and low resolution. We propose a two-stage decoupled framework: first, a video diffusion model learns generic scene priors to achieve cross-view consistent inpainting; second, a neural radiance field (NeRF) is reconstructed from the inpainted views. To our knowledge, this is the first approach to integrate video diffusion models into NeRF reconstruction, enabling robust handling of diverse degradations and controllable stylistic reconstruction. Our method comprises three core components: multi-view-to-video conversion, a consistency-aware inpainting network, and joint NeRF optimization—all trained end-to-end. Extensive experiments on both synthetic and real-world datasets demonstrate significant improvements over state-of-the-art methods, particularly under sparse-view and severely degraded conditions, with enhanced reconstruction fidelity and generalization capability.

Technology Category

Application Category

📝 Abstract
This paper tackles the challenge of robust reconstruction, i.e., the task of reconstructing a 3D scene from a set of inconsistent multi-view images. Some recent works have attempted to simultaneously remove image inconsistencies and perform reconstruction by integrating image degradation modeling into neural 3D scene representations.However, these methods rely heavily on dense observations for robustly optimizing model parameters.To address this issue, we propose to decouple robust reconstruction into two subtasks: restoration and reconstruction, which naturally simplifies the optimization process.To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images, and finally reconstructs the 3D scenes from these restored images.Compared with case-by-case per-view degradation modeling, the diffusion model learns a general scene prior from large-scale data, making it applicable to diverse image inconsistencies.Extensive experiments on both synthetic and real-world datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction. Moreover, UniVerse can control the style of the reconstructed 3D scene. Project page: https://jin-cao-tma.github.io/UniVerse.github.io/
Problem

Research questions and friction points this paper is trying to address.

Reconstructing 3D scenes from inconsistent multi-view images
Decoupling robust reconstruction into restoration and reconstruction tasks
Using video diffusion models to restore image consistency for reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples robust reconstruction into restoration and reconstruction
Uses video diffusion model to restore inconsistent image sequences
Learns general scene prior from large-scale data for diverse inconsistencies
🔎 Similar Papers
No similar papers found.
J
Jin Cao
State Key Lab of CAD&CG, Zhejiang University
H
Hongrui Wu
Tongji University
Z
Ziyong Feng
DeepGlint
H
Hujun Bao
State Key Lab of CAD&CG, Zhejiang University
Xiaowei Zhou
Xiaowei Zhou
Professor of Computer Science, Zhejiang University
Computer VisionComputer Graphics
Sida Peng
Sida Peng
Zhejiang University
Computer VisionComputer Graphics