🤖 AI Summary
Existing illumination relighting methods suffer from poor generalization—being restricted to faces or static human bodies—and lack cross-scene and temporal consistency. Method: We propose the first monocular, general-purpose human relighting framework. It leverages diffusion models as strong image priors; introduces a coarse-to-fine joint modeling strategy with spatiotemporal feature fusion; incorporates unsupervised temporal illumination modeling to learn illumination periodicity without ground-truth supervision; and employs diffusion-guided refinement to preserve high-frequency details. Contribution/Results: Our method is the first to enable controllable relighting of humans in arbitrary poses, at arbitrary body parts, and in arbitrary scenes—while harmonizing foreground illumination with background lighting. It significantly improves cross-scene generalization and video temporal coherence, overcoming data dependency bottlenecks while faithfully preserving input details.
📝 Abstract
This paper introduces Comprehensive Relighting, the first all-in-one approach that can both control and harmonize the lighting from an image or video of humans with arbitrary body parts from any scene. Building such a generalizable model is extremely challenging due to the lack of dataset, restricting existing image-based relighting models to a specific scenario (e.g., face or static human). To address this challenge, we repurpose a pre-trained diffusion model as a general image prior and jointly model the human relighting and background harmonization in the coarse-to-fine framework. To further enhance the temporal coherence of the relighting, we introduce an unsupervised temporal lighting model that learns the lighting cycle consistency from many real-world videos without any ground truth. In inference time, our temporal lighting module is combined with the diffusion models through the spatio-temporal feature blending algorithms without extra training; and we apply a new guided refinement as a post-processing to preserve the high-frequency details from the input image. In the experiments, Comprehensive Relighting shows a strong generalizability and lighting temporal coherence, outperforming existing image-based human relighting and harmonization methods.