Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address the limitations in audio-visual multimodal foundation model training—namely, data redundancy, low quality, and coarse-grained alignment that hinder retrieval performance—this paper proposes AVVA, a novel framework introducing an LLM-driven joint audio-visual quality assessment and semantic alignment paradigm. AVVA eliminates timestamp-dependent coarse synchronization and enables fine-grained cross-modal content matching without textual supervision. Methodologically, it integrates Whisper and DINOv2 encoders within a dual-encoder contrastive learning framework, augmented by an LLM-based scoring-and-filtering pipeline. Experiments demonstrate that, trained on only 192 hours of curated data, AVVA achieves a +7.6% top-1 audio-to-video retrieval accuracy on VGGSound—surpassing ImageBind trained on over 5,800 hours. Top-3 accuracy improves by 47.8, 48.4, and 58.0 percentage points on AudioCaps, VALOR, and VGGSound, respectively, empirically validating the principle that “quality trumps quantity” in multimodal pretraining.

Technology Category

Application Category

📝 Abstract

Integrating audio and visual data for training multimodal foundational models remains challenging. We present Audio-Video Vector Alignment (AVVA), which aligns audiovisual (AV) scene content beyond mere temporal synchronization via a Large Language Model (LLM)-based data curation pipeline. Specifically, AVVA scores and selects high-quality training clips using Whisper (speech-based audio foundation model) for audio and DINOv2 for video within a dual-encoder contrastive learning framework. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate that this approach can achieve significant accuracy gains with substantially less curated data. For instance, AVVA yields a 7.6% improvement in top-1 accuracy for audio-to-video retrieval on VGGSound compared to ImageBind, despite training on only 192 hours of carefully filtered data (vs. 5800+ hours). Moreover, an ablation study highlights that trading data quantity for data quality improves performance, yielding respective top-3 accuracy increases of 47.8, 48.4, and 58.0 percentage points on AudioCaps, VALOR, and VGGSound over uncurated baselines. While these results underscore AVVA's data efficiency, we also discuss the overhead of LLM-driven curation and how it may be scaled or approximated in larger domains. Overall, AVVA provides a viable path toward more robust, text-free audiovisual learning with improved retrieval accuracy.

Problem

Research questions and friction points this paper is trying to address.

Aligns audiovisual content beyond temporal synchronization using LLM-based curation.

Improves retrieval accuracy with less curated data via quality-focused training.

Demonstrates data efficiency and performance gains in multimodal foundational models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based data curation for quality training clips

Dual-encoder contrastive learning with Whisper and DINOv2

Improved accuracy with less curated data via AVVA

🔎 Similar Papers

From Image to Video, what do we need in multimodal LLMs?