🤖 AI Summary
Existing monocular video relighting methods struggle to simultaneously achieve illumination fidelity and temporal coherence, while lacking joint controllability over camera trajectories and lighting. This paper proposes a controllable 4D video generation framework that decouples geometry, motion, and illumination signals to enable coordinated manipulation of camera pose and lighting. Methodologically: (1) We introduce a dynamic point cloud representation with differentiable relighting projection to precisely separate geometric structure from illumination; (2) We design Light-Syn, a degradation-aware synthetic pipeline, to generate multi-view, multi-illumination training data; (3) We integrate text- and background-conditioned diffusion models to support high-fidelity relighting and novel-view synthesis. Experiments demonstrate that our method significantly improves visual quality and temporal consistency under joint control, outperforming state-of-the-art approaches across diverse text prompts and background conditions.
📝 Abstract
Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.