DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automatic video mashup methods suffer from insufficient multi-level coordination across semantic, visual, and auditory dimensions, often resulting in jarring visual transitions and misaligned audio, which impedes professional-grade fluency. This work formulates video mashup as a multimodal consistency satisfaction problem and introduces the first three-tier multi-agent framework—inspired by cinematic production pipelines—that automates the entire process from global structure anchoring and editing intent generation to fine-grained shot optimization. Key technical contributions include hierarchical multi-agent planning, intent-guided shot editing, source-aware structural modeling, and multimodal alignment optimization. The authors also establish Mashup-Bench, the first comprehensive benchmark for video mashup evaluation. Experiments demonstrate that the proposed method significantly outperforms state-of-the-art approaches in both objective metrics and human subjective assessments, effectively enhancing visual coherence and audio-visual synchronization.
📝 Abstract
Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio-visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross-level multimodal orchestration to achieve professional-grade fluidity, resulting in disjointed sequences with abrupt visual transitions and musical misalignment. To address this, we formulate video mashup creation as a Multimodal Coherency Satisfaction Problem (MMCSP) and propose the DIRECT framework. Simulating a professional production pipeline, our hierarchical multi-agent framework decomposes the challenge into three cascade levels: the Screenwriter for source-aware global structural anchoring, the Director for instantiating adaptive editing intent and guidance, and the Editor for intent-guided shot sequence editing with fine-grained optimization. We further introduce Mashup-Bench, a comprehensive benchmark with tailored metrics for visual continuity and auditory alignment. Extensive experiments demonstrate that DIRECT significantly outperforms state-of-the-art baselines in both objective metrics and human subjective evaluation. Project page and code: https://github.com/AK-DREAM/DIRECT
Problem

Research questions and friction points this paper is trying to address.

video mashup
multimodal coherency
automated video editing
visual continuity
auditory alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical multi-agent planning
intent-guided editing
multimodal coherency
video mashup generation
Mashup-Bench
🔎 Similar Papers
No similar papers found.
K
Ke Li
School of Electronics Engineering and Computer Science, Peking University
M
Maoliang Li
School of Computer Science, Peking University
J
Jialiang Chen
School of Electronics Engineering and Computer Science, Peking University
Jiayu Chen
Jiayu Chen
PhD student, IFLab@PKU
Efficient Visual GenerationML system
Zihao Zheng
Zihao Zheng
Peking University
Machine Learning SystemEdge ComputingComputer ArchitectureEDA
Shaoqi Wang
Shaoqi Wang
University of Colorado, Colorado Springs
Big Data AnalyticsDistributed Machine LearningDeep Learning
X
Xiang Chen
School of Computer Science, Peking University