🤖 AI Summary
This work addresses the lack of fully open-source, state-of-the-art large audio-language models (Audio-LLMs) capable of unified understanding and multimodal reasoning across speech, environmental sounds, and music. To this end, we propose AF-LLM—a novel Audio-LLM featuring: (1) a unified audio encoder (AF-Whisper), (2) a flexible reasoning mechanism (AF-Think), (3) a dialogue architecture supporting multi-turn, multi-audio inputs (AF-Chat), and (4) long-audio modeling up to 10 minutes. Training employs a five-stage curriculum learning strategy on our newly curated high-quality datasets—AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat. Evaluated across over 20 audio understanding and reasoning benchmarks, AF-LLM achieves superior performance among all open-source models and surpasses several larger closed-source counterparts on key metrics, thereby significantly advancing open research in audio intelligence.
📝 Abstract
We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.