Over++: Generative Video Compositing for Layer Interaction Effects

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

In professional video compositing, environment interactions—such as shadows, reflections, dust, and splashes—between foreground and background have traditionally relied on labor-intensive manual creation. Existing video generation models struggle to inject photorealistic interactions while preserving the input video; conversely, video inpainting methods suffer from either requiring frame-wise manual masks or producing geometrically distorted outputs. To address this, we introduce “enhanced compositing” as a novel task, construct the first paired dataset of videos with environment interaction effects, and propose a self-supervised, unpaired training strategy jointly guided by text prompts, segmentation masks, and keyframes. Our method integrates video diffusion models, unsupervised spatiotemporal consistency modeling, lightweight mask fusion, and keyframe-guided distillation. Experiments demonstrate that our approach generates diverse, high-fidelity semi-transparent interactions under data constraints, achieving state-of-the-art performance in both interaction realism and source scene fidelity.

Technology Category

Application Category

📝 Abstract

In professional video compositing workflows, artists must manually create environmental interactions-such as shadows, reflections, dust, and splashes-between foreground subjects and background layers. Existing video generative models struggle to preserve the input video while adding such effects, and current video inpainting methods either require costly per-frame masks or yield implausible results. We introduce augmented compositing, a new task that synthesizes realistic, semi-transparent environmental effects conditioned on text prompts and input video layers, while preserving the original scene. To address this task, we present Over++, a video effect generation framework that makes no assumptions about camera pose, scene stationarity, or depth supervision. We construct a paired effect dataset tailored for this task and introduce an unpaired augmentation strategy that preserves text-driven editability. Our method also supports optional mask control and keyframe guidance without requiring dense annotations. Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in both effect generation and scene preservation.

Problem

Research questions and friction points this paper is trying to address.

Synthesizes realistic environmental interactions between video layers.

Preserves original video while adding text-guided effects.

Eliminates need for manual masks or depth supervision.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates semi-transparent environmental effects from text prompts

Uses unpaired augmentation to maintain text-driven editability

Supports optional mask and keyframe control without dense annotations

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

AI Research Scientist, Computer Vision - Facebook Video Intelligence