Layer-Aware Video Composition via Split-then-Merge

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address the challenges of coarse-grained control and scarcity of high-quality labeled data in generative video synthesis, this paper proposes the Split-then-Merge framework. It first disentangles unlabeled videos into dynamic foreground and static background layers, then reconstructs them via self-supervised composition to establish a controllable generation pathway. The method introduces four key innovations: (1) a hierarchical self-composition mechanism, (2) a transformation-aware training procedure, (3) a multi-level fusion enhancement strategy, and (4) an identity-preserving loss—collectively enhancing explicit motion semantics modeling and foreground detail fidelity. Extensive evaluations on multiple quantitative benchmarks, as well as human assessments and VLLM-based qualitative analysis, demonstrate that our approach consistently surpasses state-of-the-art methods, achieving significant improvements in visual realism, temporal coherence, and interaction plausibility.

Technology Category

Application Category

📝 Abstract

We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io

Problem

Research questions and friction points this paper is trying to address.

Enhancing control in generative video composition without annotated datasets

Addressing data scarcity through self-composition of unlabeled video layers

Learning realistic dynamic interactions between foreground and background elements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Splits videos into foreground and background layers

Self-composes layers to learn compositional dynamics

Uses transformation-aware training with identity-preservation loss

🔎 Similar Papers

No similar papers found.