AutoMV: An Automatic Multi-Agent System for Music Video Generation

📅 2025-12-13

📈 Citations: 0

✨ Influential: 0

career value

250K/year

🤖 AI Summary

Music-to-video (M2V) generation faces core challenges including structural fragmentation in full-length videos, inaccurate beat/lyric alignment, and temporal incoherence. To address these, we propose the first end-to-end framework for generating full-length music videos, built upon a multi-agent collaborative architecture that integrates music parsing, lyric temporal alignment, LLM-driven scriptwriting and directorial scheduling, and a closed-loop verification and feedback mechanism. We further introduce the first domain-specific, four-dimensional, twelve-metric evaluation benchmark for M2V—spanning signal processing, multimodal generation fidelity, and expert human assessment. Extensive experiments demonstrate that our method consistently outperforms commercial baselines across all four key dimensions: structural integrity, rhythmic consistency, semantic alignment, and visual quality. Notably, it generates minute-long MVs approaching professional human production standards.

Technology Category

Application Category

📝 Abstract

Music-to-Video (M2V) generation for full-length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi-agent system that generates full music videos (MVs) directly from a song. AutoMV first applies music processing tools to extract musical attributes, such as structure, vocal tracks, and time-aligned lyrics, and constructs these features as contextual inputs for following agents. The screenwriter Agent and director Agent then use this information to design short script, define character profiles in a shared external bank, and specify camera instructions. Subsequently, these agents call the image generator for keyframes and different video generators for "story" or "singer" scenes. A Verifier Agent evaluates their output, enabling multi-agent collaboration to produce a coherent longform MV. To evaluate M2V generation, we further propose a benchmark with four high-level categories (Music Content, Technical, Post-production, Art) and twelve ine-grained criteria. This benchmark was applied to compare commercial products, AutoMV, and human-directed MVs with expert human raters: AutoMV outperforms current baselines significantly across all four categories, narrowing the gap to professional MVs. Finally, we investigate using large multimodal models as automatic MV judges; while promising, they still lag behind human expert, highlighting room for future work.

Problem

Research questions and friction points this paper is trying to address.

Generating full-length music videos from songs with visual-musical alignment

Overcoming short disjointed clips lacking temporal consistency in existing methods

Establishing evaluation benchmarks for music-to-video generation quality assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent system for full-length music video generation

Music processing extracts structure and lyrics for contextual input

Verifier Agent ensures coherence through multi-agent collaboration

🔎 Similar Papers

No similar papers found.