Layer-Aware Video Composition via Split-then-Merge

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of coarse-grained control and scarcity of high-quality labeled data in generative video synthesis, this paper proposes the Split-then-Merge framework. It first disentangles unlabeled videos into dynamic foreground and static background layers, then reconstructs them via self-supervised composition to establish a controllable generation pathway. The method introduces four key innovations: (1) a hierarchical self-composition mechanism, (2) a transformation-aware training procedure, (3) a multi-level fusion enhancement strategy, and (4) an identity-preserving loss—collectively enhancing explicit motion semantics modeling and foreground detail fidelity. Extensive evaluations on multiple quantitative benchmarks, as well as human assessments and VLLM-based qualitative analysis, demonstrate that our approach consistently surpasses state-of-the-art methods, achieving significant improvements in visual realism, temporal coherence, and interaction plausibility.

Technology Category

Application Category

📝 Abstract
We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io
Problem

Research questions and friction points this paper is trying to address.

Enhancing control in generative video composition without annotated datasets
Addressing data scarcity through self-composition of unlabeled video layers
Learning realistic dynamic interactions between foreground and background elements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Splits videos into foreground and background layers
Self-composes layers to learn compositional dynamics
Uses transformation-aware training with identity-preservation loss
🔎 Similar Papers
No similar papers found.