Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the lack of fully open-source, state-of-the-art large audio-language models (Audio-LLMs) capable of unified understanding and multimodal reasoning across speech, environmental sounds, and music. To this end, we propose AF-LLM—a novel Audio-LLM featuring: (1) a unified audio encoder (AF-Whisper), (2) a flexible reasoning mechanism (AF-Think), (3) a dialogue architecture supporting multi-turn, multi-audio inputs (AF-Chat), and (4) long-audio modeling up to 10 minutes. Training employs a five-stage curriculum learning strategy on our newly curated high-quality datasets—AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat. Evaluated across over 20 audio understanding and reasoning benchmarks, AF-LLM achieves superior performance among all open-source models and surpasses several larger closed-source counterparts on key metrics, thereby significantly advancing open research in audio intelligence.

Technology Category

Application Category

📝 Abstract

We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.

Problem

Research questions and friction points this paper is trying to address.

Advancing audio intelligence with open large models

Unified learning for speech, sound, and music

Enhancing long audio understanding and reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified audio encoder for speech, sound, music

Flexible on-demand chain-of-thought reasoning

Five-stage curriculum-based training strategy

🔎 Similar Papers

No similar papers found.

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs