π€ AI Summary
This work addresses the challenges of temporal consistency and photorealistic appearance in full-body human video relighting by proposing a subject-specific relighting framework based on video diffusion models. The method fine-tunes a pretrained text-to-video diffusion model and introduces a tokenized light source representation coupled with a masked attention mechanism to enable precise control over dynamic illumination sequences. A novel high-frequency interleaved OLAT (Object Lighting at Three) capture strategy is devised to construct a diverse, hybrid training dataset while mitigating flickering artifacts. Experimental results demonstrate that the proposed approach consistently generates temporally coherent, realistic, and robust relighting results across complex body poses, camera viewpoints, and lighting conditions.
π Abstract
Being able to relight human performance is a fundamental task for post production and content creation. We present BodyReLux, a subject-specific video diffusion-based framework for relighting full-body human performances in a temporally consistent way. Our model is trained on a hybrid dataset of pixel-aligned video relighting pairs, covering a diverse combination of lighting conditions, performances and viewpoints. To acquire such dataset, we combine traditional static One-Light-at-a-Time (OLAT) capture and a novel dynamic performance capture in which two smoothly varying lighting sequences are rapidly interleaved. Because the lighting operates above the human flicker-fusion threshold, the interleaving does not appear to strobe. We train our video relighting model from a pretrained text-to-video model to fully leverage the generative priors for producing high quality videos. To achieve accurate lighting control, we introduce a new lighting conditioning method that represents each light source as a token. We further condition on sequences of lighting using masked attention to support dynamic lighting control. Together with a carefully designed data augmentation pipeline, we achieve photorealistic, robust, and temporally consistent video relighting of subject-specific human performances.