🤖 AI Summary
This work addresses the lack of standardized evaluation for minute-long audio-visual generation under diverse multimodal conditions such as text, image, and video prompts—a gap left by existing benchmarks that focus primarily on short clips. To bridge this, we introduce LongAV-Compass, the first systematic benchmark for evaluating long-form multimodal audio-visual synthesis, encompassing Text-to-AudioVisual (T2AV), Image-to-AudioVisual (I2AV), and Video-to-AudioVisual (V2AV) tasks. We construct a categorically organized test set and devise a fine-grained evaluation framework spanning over twenty dimensions, including intra-segment quality, inter-segment consistency, global narrative coherence, and audio-visual synchronization. Leveraging multimodal models such as MLLMs, DINO-v2, ArcFace, CLIP, and ImageBind, we establish a comprehensive assessment pipeline integrating semantic, identity, visual, and acoustic features. Experiments across eleven state-of-the-art models, validated by human evaluation, demonstrate the benchmark’s efficacy and expose critical limitations in current methods regarding temporal consistency and cross-modal alignment over extended durations.
📝 Abstract
Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematic benchmark for minute-long audio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy-guided benchmark construction with a unified evaluation framework that integrates MLLM-assisted assessment with complementary perceptual and multimodal metrics, including DINO-v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scale audio-visual generation across diverse input modalities.