SViM3D: Stable Video Material Diffusion for Single Image 3D Generation

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenging problem of reconstructing multi-view consistent, relightable PBR materials and geometry from a single input image. We propose the first latent video diffusion-based framework for single-image PBR reconstruction, jointly predicting spatially varying PBR parameters (albedo, roughness, metallic) and normal maps. Our method incorporates explicit camera pose conditioning, a multi-view consistency loss, material-aware knowledge distillation, and a progressive denoising architecture. Key contributions include: (1) the first application of video diffusion models to jointly model material and geometry while enforcing multi-view consistency; and (2) a neural prior mechanism enabling stable training without paired 3D supervision and supporting controllable appearance editing. Extensive experiments demonstrate state-of-the-art performance in relighting and novel-view synthesis across multiple object-centric datasets. The framework enables high-fidelity, editable 3D asset generation for applications in AR/VR, film, and real-time graphics.

Technology Category

Application Category

📝 Abstract
We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially varying PBR parameters and surface normals jointly with each generated view based on explicit camera control. This unique setup allows for relighting and generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.
Problem

Research questions and friction points this paper is trying to address.

Predicting multi-view consistent PBR materials from single images
Generating relightable 3D assets with controlled appearance editing
Extending video diffusion models for joint PBR parameter estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends video diffusion model for PBR materials
Uses explicit camera control for view generation
Introduces mechanisms for improved quality in 3D
🔎 Similar Papers
No similar papers found.