Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of non-speech sound and music understanding and reasoning in long-duration audio (30 seconds to 5 minutes). We propose the first end-to-end long-audio–language model, integrating a customized CLAP-based audio encoder, synthetically generated audio question-answering data, multi-stage curriculum learning, and a segment-wise attention fusion mechanism. We introduce LongAudio, the first dedicated long-audio dataset, and LongAudioBench, a comprehensive evaluation benchmark. Our 3-billion-parameter model achieves state-of-the-art performance across 20+ public benchmarks and significantly outperforms existing audio–language models on LongAudioBench. Ablation studies confirm the critical contribution of each component. This work establishes the first systematic framework for fine-grained semantic understanding and reasoning over minute-scale audio, setting a new baseline for long-audio AI.

Technology Category

Application Category

📝 Abstract
Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across over 20 benchmarks. Next, for the first time, we extend audio understanding to long audio segments (30 secs to 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks. Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities. We conduct extensive ablation studies to confirm the efficacy of our approach. Project Website: https://research.nvidia.com/labs/adlr/AF2/.
Problem

Research questions and friction points this paper is trying to address.

Develops Audio Flamingo 2 for advanced audio understanding and reasoning.
Extends audio understanding to long audio segments up to 5 minutes.
Introduces LongAudio dataset for training on long audio tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Custom CLAP model for audio understanding
Synthetic Audio QA data for fine-grained reasoning
Multi-stage curriculum learning strategy
🔎 Similar Papers
No similar papers found.