🤖 AI Summary
Existing inverse-rendering approaches struggle with real-world videos due to the lack of precise 3D geometry, material, and lighting priors, hindering accurate forward rendering. To address this, we propose the first end-to-end neural rendering framework that leverages video diffusion models to jointly estimate G-buffers (geometry, materials, lighting) and synthesize photorealistic, lighting-consistent novel views from a single input video. Our contributions are threefold: (1) the first unified use of video diffusion models for both inverse inference and forward generation; (2) a G-buffer-conditioned generative architecture with explicit temporal consistency modeling; and (3) a joint inverse-forward weakly supervised training paradigm. Evaluated on multiple benchmarks, our method surpasses state-of-the-art approaches in PSNR and SSIM, enabling real-time relighting, material editing, and object insertion while preserving spatiotemporal coherence.
📝 Abstract
Understanding and modeling lighting effects are fundamental tasks in computer vision and graphics. Classic physically-based rendering (PBR) accurately simulates the light transport, but relies on precise scene representations--explicit 3D geometry, high-quality material properties, and lighting conditions--that are often impractical to obtain in real-world scenarios. Therefore, we introduce DiffusionRenderer, a neural approach that addresses the dual problem of inverse and forward rendering within a holistic framework. Leveraging powerful video diffusion model priors, the inverse rendering model accurately estimates G-buffers from real-world videos, providing an interface for image editing tasks, and training data for the rendering model. Conversely, our rendering model generates photorealistic images from G-buffers without explicit light transport simulation. Experiments demonstrate that DiffusionRenderer effectively approximates inverse and forwards rendering, consistently outperforming the state-of-the-art. Our model enables practical applications from a single video input--including relighting, material editing, and realistic object insertion.