🤖 AI Summary
In professional video compositing, environment interactions—such as shadows, reflections, dust, and splashes—between foreground and background have traditionally relied on labor-intensive manual creation. Existing video generation models struggle to inject photorealistic interactions while preserving the input video; conversely, video inpainting methods suffer from either requiring frame-wise manual masks or producing geometrically distorted outputs. To address this, we introduce “enhanced compositing” as a novel task, construct the first paired dataset of videos with environment interaction effects, and propose a self-supervised, unpaired training strategy jointly guided by text prompts, segmentation masks, and keyframes. Our method integrates video diffusion models, unsupervised spatiotemporal consistency modeling, lightweight mask fusion, and keyframe-guided distillation. Experiments demonstrate that our approach generates diverse, high-fidelity semi-transparent interactions under data constraints, achieving state-of-the-art performance in both interaction realism and source scene fidelity.
📝 Abstract
In professional video compositing workflows, artists must manually create environmental interactions-such as shadows, reflections, dust, and splashes-between foreground subjects and background layers. Existing video generative models struggle to preserve the input video while adding such effects, and current video inpainting methods either require costly per-frame masks or yield implausible results. We introduce augmented compositing, a new task that synthesizes realistic, semi-transparent environmental effects conditioned on text prompts and input video layers, while preserving the original scene. To address this task, we present Over++, a video effect generation framework that makes no assumptions about camera pose, scene stationarity, or depth supervision. We construct a paired effect dataset tailored for this task and introduce an unpaired augmentation strategy that preserves text-driven editability. Our method also supports optional mask control and keyframe guidance without requiring dense annotations. Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in both effect generation and scene preservation.