Light-X: Generative 4D Video Rendering with Camera and Illumination Control

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing monocular video relighting methods struggle to simultaneously achieve illumination fidelity and temporal coherence, while lacking joint controllability over camera trajectories and lighting. This paper proposes a controllable 4D video generation framework that decouples geometry, motion, and illumination signals to enable coordinated manipulation of camera pose and lighting. Methodologically: (1) We introduce a dynamic point cloud representation with differentiable relighting projection to precisely separate geometric structure from illumination; (2) We design Light-Syn, a degradation-aware synthetic pipeline, to generate multi-view, multi-illumination training data; (3) We integrate text- and background-conditioned diffusion models to support high-fidelity relighting and novel-view synthesis. Experiments demonstrate that our method significantly improves visual quality and temporal consistency under joint control, outperforming state-of-the-art approaches across diverse text prompts and background conditions.

Technology Category

Application Category

📝 Abstract
Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.
Problem

Research questions and friction points this paper is trying to address.

Enables joint control of camera trajectory and illumination in video generation
Decouples geometry and lighting signals for high-quality disentanglement
Synthesizes training data from monocular videos to overcome dataset limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangles geometry and lighting via dynamic point clouds
Synthesizes training data from monocular videos using degradation pipeline
Enables joint camera and illumination control for 4D video generation