GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

262K/year

🤖 AI Summary

This work addresses the challenge of jointly optimizing rhythm alignment, user intent, narrative coherence, and long-range structural consistency in music-driven nonlinear video editing. To this end, we propose GLANCE, a global–local collaborative multi-agent framework wherein an outer loop performs long-range structural planning while an inner loop executes segment-wise editing through an “observe–reason–act–verify” cycle to ensure holistic coordination. Our approach introduces a context controller, a conflict-region decomposition module, and a bottom-up dynamic negotiation mechanism. We also construct MVEBench, the first dedicated benchmark for this task, along with an agent-based evaluation protocol. Experiments show that, using the same backbone model, GLANCE improves performance by 33.2% and 15.6% on two key tasks, respectively, with human evaluations confirming both the generation quality and the effectiveness of the proposed evaluation framework.

Technology Category

Application Category

📝 Abstract

Music-grounded mashup video creation is a challenging form of video non-linear editing, where a system must compose a coherent timeline from large collections of source videos while aligning with music rhythm, user intent, story completeness, and long-range structural constraints. Existing approaches typically rely on fixed pipelines or simplified retrieval-and-concatenation paradigms, limiting their ability to adapt to diverse prompts and heterogeneous source materials. In this paper, we present GLANCE, a global-local coordination multi-agent framework for music-grounded nonlinear video editing. GLANCE adopts a bi-loop architecture for better editing practice: an outer loop performs long-horizon planning and task-graph construction, and an inner loop adopts the "Observe-Think-Act-Verify" flow for segment-wise editing tasks and their refinements. To address the cross-segment and global conflict emerging after subtimelines composition, we introduce a dedicated global-local coordination mechanism with both preventive and corrective components, which includes a novelly designed context controller, conflict region decomposition module, and a bottom-up dynamic negotiation mechanism. To support rigorous evaluation, we construct MVEBench, a new benchmark that factorizes editing difficulty along task type, prompt specificity, and music length, and propose an agent-as-a-judge evaluation framework for scalable multi-dimensional assessment. Experimental results show that GLANCE consistently outperforms prior research baselines and open-source product baselines under the same backbone models. With GPT-4o-mini as the backbone, GLANCE improves over the strongest baseline by 33.2% and 15.6% on two task settings, respectively. Human evaluation further confirms the quality of the generated videos and validates the effectiveness of the proposed evaluation framework.

Problem

Research questions and friction points this paper is trying to address.

music-grounded video editing

non-linear video editing

multi-agent framework

global-local coordination

video mashup

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent framework

global-local coordination

non-linear video editing