OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key challenges in mask-free video object insertion—data scarcity, subject-scene imbalance, and inconsistent insertion fusion—this paper introduces OmniInsert, a unified framework. Methodologically, it leverages a diffusion-based Transformer architecture to enable end-to-end, high-fidelity insertion from single or multiple reference images. Its core contributions include: (1) InsertPipe, a scalable data pipeline that alleviates annotation bottlenecks; (2) conditional subject-specific feature injection and context-aware rewriting modules to enhance spatiotemporal coherence; and (3) a progressive training strategy with subject-focused loss and insertion preference optimization to improve detail fidelity and natural integration. Evaluated on the newly proposed benchmark InsertBench, OmniInsert significantly outperforms existing closed-source methods, achieving state-of-the-art performance in subject detail preservation and scene fusion quality—advancing the practicality of controllable video editing.

Technology Category

Application Category

📝 Abstract
Recent advances in video insertion based on diffusion models are impressive. However, existing methods rely on complex control signals but struggle with subject consistency, limiting their practical applicability. In this paper, we focus on the task of Mask-free Video Insertion and aim to resolve three key challenges: data scarcity, subject-scene equilibrium, and insertion harmonization. To address the data scarcity, we propose a new data pipeline InsertPipe, constructing diverse cross-pair data automatically. Building upon our data pipeline, we develop OmniInsert, a novel unified framework for mask-free video insertion from both single and multiple subject references. Specifically, to maintain subject-scene equilibrium, we introduce a simple yet effective Condition-Specific Feature Injection mechanism to distinctly inject multi-source conditions and propose a novel Progressive Training strategy that enables the model to balance feature injection from subjects and source video. Meanwhile, we design the Subject-Focused Loss to improve the detailed appearance of the subjects. To further enhance insertion harmonization, we propose an Insertive Preference Optimization methodology to optimize the model by simulating human preferences, and incorporate a Context-Aware Rephraser module during reference to seamlessly integrate the subject into the original scenes. To address the lack of a benchmark for the field, we introduce InsertBench, a comprehensive benchmark comprising diverse scenes with meticulously selected subjects. Evaluation on InsertBench indicates OmniInsert outperforms state-of-the-art closed-source commercial solutions. The code will be released.
Problem

Research questions and friction points this paper is trying to address.

Addressing mask-free video insertion challenges without complex control signals
Resolving subject-scene equilibrium and insertion harmonization in video editing
Overcoming data scarcity through automated cross-pair data construction pipeline
Innovation

Methods, ideas, or system contributions that make the work stand out.

InsertPipe data pipeline for cross-pair data
Condition-Specific Feature Injection for multi-source conditions
Insertive Preference Optimization simulating human preferences
🔎 Similar Papers
No similar papers found.
J
Jinshu Chen
Intelligent Creation Lab, ByteDance
X
Xinghui Li
Intelligent Creation Lab, ByteDance
X
Xu Bai
Intelligent Creation Lab, ByteDance
Tianxiang Ma
Tianxiang Ma
ByteDance Inc.<< NLPR, CASIA
Computer VisionDeep LearningAIGC
P
Pengze Zhang
Intelligent Creation Lab, ByteDance
Zhuowei Chen
Zhuowei Chen
Bytedance
Video GenerationMultimodal Generation
G
Gen Li
Intelligent Creation Lab, ByteDance
Lijie Liu
Lijie Liu
ByteDance Inc.
Computer Vision
S
Songtao Zhao
Intelligent Creation Lab, ByteDance
Bingchuan Li
Bingchuan Li
ByteDance
Qian He
Qian He
ByteDance