🤖 AI Summary
To address the challenges of coarse-grained control and scarcity of high-quality labeled data in generative video synthesis, this paper proposes the Split-then-Merge framework. It first disentangles unlabeled videos into dynamic foreground and static background layers, then reconstructs them via self-supervised composition to establish a controllable generation pathway. The method introduces four key innovations: (1) a hierarchical self-composition mechanism, (2) a transformation-aware training procedure, (3) a multi-level fusion enhancement strategy, and (4) an identity-preserving loss—collectively enhancing explicit motion semantics modeling and foreground detail fidelity. Extensive evaluations on multiple quantitative benchmarks, as well as human assessments and VLLM-based qualitative analysis, demonstrate that our approach consistently surpasses state-of-the-art methods, achieving significant improvements in visual realism, temporal coherence, and interaction plausibility.
📝 Abstract
We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io