🤖 AI Summary
Existing automatic video mashup methods suffer from insufficient multi-level coordination across semantic, visual, and auditory dimensions, often resulting in jarring visual transitions and misaligned audio, which impedes professional-grade fluency. This work formulates video mashup as a multimodal consistency satisfaction problem and introduces the first three-tier multi-agent framework—inspired by cinematic production pipelines—that automates the entire process from global structure anchoring and editing intent generation to fine-grained shot optimization. Key technical contributions include hierarchical multi-agent planning, intent-guided shot editing, source-aware structural modeling, and multimodal alignment optimization. The authors also establish Mashup-Bench, the first comprehensive benchmark for video mashup evaluation. Experiments demonstrate that the proposed method significantly outperforms state-of-the-art approaches in both objective metrics and human subjective assessments, effectively enhancing visual coherence and audio-visual synchronization.
📝 Abstract
Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio-visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross-level multimodal orchestration to achieve professional-grade fluidity, resulting in disjointed sequences with abrupt visual transitions and musical misalignment. To address this, we formulate video mashup creation as a Multimodal Coherency Satisfaction Problem (MMCSP) and propose the DIRECT framework. Simulating a professional production pipeline, our hierarchical multi-agent framework decomposes the challenge into three cascade levels: the Screenwriter for source-aware global structural anchoring, the Director for instantiating adaptive editing intent and guidance, and the Editor for intent-guided shot sequence editing with fine-grained optimization. We further introduce Mashup-Bench, a comprehensive benchmark with tailored metrics for visual continuity and auditory alignment. Extensive experiments demonstrate that DIRECT significantly outperforms state-of-the-art baselines in both objective metrics and human subjective evaluation. Project page and code: https://github.com/AK-DREAM/DIRECT