LuxDiT: Lighting Estimation with Video Diffusion Transformer

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Single-image and video illumination estimation has long been hindered by the scarcity of real-world HDR environment maps, challenges in modeling global contextual dependencies, and difficulties in generating high-dynamic-range outputs. To address these, we propose the first illumination estimation framework based on a video diffusion Transformer, leveraging temporal priors from video sequences to enhance global illumination consistency. We introduce a LoRA-based fine-tuning strategy to improve semantic alignment and recover high-frequency angular details. Our model is jointly trained on large-scale synthetic illumination data and real HDR panoramic images, enabling effective learning of indirect visual cues. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches on both synthetic and real-world benchmarks. The generated HDR environment maps exhibit accurate spatial illumination distributions and rich high-frequency detail, substantially improving realism and generalization in illumination reconstruction.

Technology Category

Application Category

📝 Abstract

Estimating scene lighting from a single image or video remains a longstanding challenge in computer vision and graphics. Learning-based approaches are constrained by the scarcity of ground-truth HDR environment maps, which are expensive to capture and limited in diversity. While recent generative models offer strong priors for image synthesis, lighting estimation remains difficult due to its reliance on indirect visual cues, the need to infer global (non-local) context, and the recovery of high-dynamic-range outputs. We propose LuxDiT, a novel data-driven approach that fine-tunes a video diffusion transformer to generate HDR environment maps conditioned on visual input. Trained on a large synthetic dataset with diverse lighting conditions, our model learns to infer illumination from indirect visual cues and generalizes effectively to real-world scenes. To improve semantic alignment between the input and the predicted environment map, we introduce a low-rank adaptation finetuning strategy using a collected dataset of HDR panoramas. Our method produces accurate lighting predictions with realistic angular high-frequency details, outperforming existing state-of-the-art techniques in both quantitative and qualitative evaluations.

Problem

Research questions and friction points this paper is trying to address.

Estimating scene lighting from single image or video input

Overcoming scarcity of ground-truth HDR environment maps

Inferring global illumination from indirect visual cues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video diffusion transformer for HDR lighting

Low-rank adaptation finetuning strategy

Synthetic dataset training for generalization

🔎 Similar Papers

Dark Transformer: A Video Transformer for Action Recognition in the Dark