🤖 AI Summary
Existing video generation models struggle to jointly model scene intrinsic properties—such as albedo, surface normals, material, and irradiance—and lack a closed-loop framework supporting physical interpretability and editable control. This work introduces the first end-to-end intrinsic-aware video editing framework, enabling inverse decomposition from video to intrinsic channels and photorealistic video synthesis and propagation conditioned on keyframes. Methodologically, we unify video inverse rendering, intrinsic-driven synthesis, and keyframe-conditioned editing for the first time; we design an interleaved, physics-guided conditional mechanism that enables intuitive, differentiable manipulation of arbitrary intrinsic modalities. By incorporating temporal consistency constraints and multimodal conditional modeling, our approach generates physically plausible, temporally coherent high-fidelity videos. Extensive experiments demonstrate significant improvements over state-of-the-art methods on object appearance editing and scene relighting tasks.
📝 Abstract
Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.