Long-Video Audio Synthesis with Multi-Agent Collaboration

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Long-video audio-visual synchronization faces challenges including dynamic semantic evolution, temporal misalignment, and the absence of dedicated benchmarks, hindering cross-scenario consistency and long-duration alignment. To address these, we propose LVAS-Agent—the first multi-agent collaborative framework for long-video audio synthesis, inspired by professional dubbing workflows. It enables end-to-end coordination across scene segmentation, script generation, sound-effect design, and hierarchical audio synthesis. Innovatively, it introduces a “discussion-refinement” mechanism to enhance scene-script consistency and a “generation-retrieval” loop to strengthen cross-modal semantic and temporal alignment. We further release LVAS-Bench, the first benchmark dedicated to long-video audio-visual synthesis. Experiments demonstrate that LVAS-Agent significantly outperforms existing methods in alignment fidelity, cross-scenario coherence, and long-term temporal stability, achieving a systematic performance leap.

Technology Category

Application Category

📝 Abstract

Video-to-audio synthesis, which generates synchronized audio for visual content, critically enhances viewer immersion and narrative coherence in film and interactive media. However, video-to-audio dubbing for long-form content remains an unsolved challenge due to dynamic semantic shifts, temporal misalignment, and the absence of dedicated datasets. While existing methods excel in short videos, they falter in long scenarios (e.g., movies) due to fragmented synthesis and inadequate cross-scene consistency. We propose LVAS-Agent, a novel multi-agent framework that emulates professional dubbing workflows through collaborative role specialization. Our approach decomposes long-video synthesis into four steps including scene segmentation, script generation, sound design and audio synthesis. Central innovations include a discussion-correction mechanism for scene/script refinement and a generation-retrieval loop for temporal-semantic alignment. To enable systematic evaluation, we introduce LVAS-Bench, the first benchmark with 207 professionally curated long videos spanning diverse scenarios. Experiments demonstrate superior audio-visual alignment over baseline methods.

Problem

Research questions and friction points this paper is trying to address.

Synthesize synchronized audio for long-form video content.

Address dynamic semantic shifts and temporal misalignment issues.

Improve cross-scene consistency in long-video audio synthesis.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework for long-video audio synthesis

Discussion-correction mechanism for scene refinement

Generation-retrieval loop for temporal-semantic alignment

🔎 Similar Papers

LoVA: Long-form Video-to-Audio Generation