🤖 AI Summary
This work addresses key challenges in single-image/video relighting—namely, inaccurate intrinsic property estimation, poor generalization, and error accumulation in two-stage pipelines—by proposing an end-to-end paradigm that jointly estimates albedo and synthesizes relit output. Departing from conventional decomposition-then-composition frameworks, our method implicitly models complex light-material interactions (e.g., shadows, specularities, transparency), enhancing the disentanglement of reflectance and illumination. Leveraging a video diffusion model, we train on synthetically generated multi-illumination data augmented with large-scale auto-annotated real-world videos. Crucially, temporal consistency is preserved throughout inference. Experiments demonstrate significant improvements in relighting quality across diverse scenes, challenging lighting conditions, and heterogeneous materials. Our approach achieves superior visual fidelity and cross-domain generalization compared to current state-of-the-art methods.
📝 Abstract
We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency.