InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

๐Ÿ“… 2025-12-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenging problem of geometrically consistent and photorealistic object insertion into real-world videosโ€”hampered by insufficient 4D scene understanding, difficulty modeling dynamic occlusions, and inconsistent lighting. To this end, we propose the first joint architecture integrating a 4D-aware mask generation module with a lighting-aware diffusion model for video object insertion. We introduce ROSE++, a large-scale synthetic dataset enabling end-to-end supervised training. Our framework unifies 4D scene reconstruction, temporal mask propagation, diffusion model fine-tuning, vision-language model (VLM)-assisted annotation, physics-inspired illumination modeling, and local rendering. Extensive evaluations across diverse real-world scenes demonstrate substantial improvements over state-of-the-art methods, achieving high-fidelity 4D video synthesis with precise object placement, physically plausible occlusion handling, photometrically coherent lighting, and temporally seamless motion.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present InsertAnywhere, a new VOI framework that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our method begins with a 4D aware mask generation module that reconstructs the scene geometry and propagates user specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination aware synthetic dataset constructed by transforming the ROSE object removal dataset into triplets of object removed video, object present video, and a VLM generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real world scenarios, significantly outperforming existing research and commercial models.
Problem

Research questions and friction points this paper is trying to address.

Realistic video object insertion with 4D scene geometry
Handling occlusion and lighting effects in video synthesis
Achieving geometrically consistent and appearance-faithful object placement
Innovation

Methods, ideas, or system contributions that make the work stand out.

4D aware mask generation for geometry reconstruction
Diffusion model extension for object and lighting synthesis
Illumination aware synthetic dataset ROSE++ for training
๐Ÿ”Ž Similar Papers
No similar papers found.