DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Existing automatic video mashup methods suffer from insufficient multi-level coordination across semantic, visual, and auditory dimensions, often resulting in jarring visual transitions and misaligned audio, which impedes professional-grade fluency. This work formulates video mashup as a multimodal consistency satisfaction problem and introduces the first three-tier multi-agent framework—inspired by cinematic production pipelines—that automates the entire process from global structure anchoring and editing intent generation to fine-grained shot optimization. Key technical contributions include hierarchical multi-agent planning, intent-guided shot editing, source-aware structural modeling, and multimodal alignment optimization. The authors also establish Mashup-Bench, the first comprehensive benchmark for video mashup evaluation. Experiments demonstrate that the proposed method significantly outperforms state-of-the-art approaches in both objective metrics and human subjective assessments, effectively enhancing visual coherence and audio-visual synchronization.

Technology Category

Application Category

📝 Abstract

Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio-visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross-level multimodal orchestration to achieve professional-grade fluidity, resulting in disjointed sequences with abrupt visual transitions and musical misalignment. To address this, we formulate video mashup creation as a Multimodal Coherency Satisfaction Problem (MMCSP) and propose the DIRECT framework. Simulating a professional production pipeline, our hierarchical multi-agent framework decomposes the challenge into three cascade levels: the Screenwriter for source-aware global structural anchoring, the Director for instantiating adaptive editing intent and guidance, and the Editor for intent-guided shot sequence editing with fine-grained optimization. We further introduce Mashup-Bench, a comprehensive benchmark with tailored metrics for visual continuity and auditory alignment. Extensive experiments demonstrate that DIRECT significantly outperforms state-of-the-art baselines in both objective metrics and human subjective evaluation. Project page and code: https://github.com/AK-DREAM/DIRECT

Problem

Research questions and friction points this paper is trying to address.

video mashup

multimodal coherency

automated video editing

visual continuity

auditory alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical multi-agent planning

intent-guided editing

multimodal coherency