Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

📅 2026-04-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
This work addresses the performance limitations of existing audio language models on long-form audio and complex semantic tasks by introducing a new open-source audio language model. Leveraging a large-scale, diverse audio dataset exceeding one million hours, the model is trained via a staged curriculum learning strategy and a multi-phase training pipeline encompassing pre-training, mid-training, and post-training stages, enabling it to process audio inputs up to 30 minutes in duration. A key innovation is the proposed Temporal Audio Chain-of-Thought reasoning paradigm, which explicitly aligns intermediate reasoning steps with corresponding audio timestamps. Evaluated across 20 audio understanding and reasoning benchmarks, the model significantly outperforms current open-source alternatives and even surpasses larger closed-source systems on several metrics, demonstrating exceptional generalization capability and practical potential.

Technology Category

Application Category

📝 Abstract
We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of Audio Flamingo 3 to identify key gaps in audio understanding and reasoning. We then curate and scale new large-scale datasets totaling over 1 million hours to address these limitations and expand the existing AudioSkills-XL, LongAudio-XL, AF-Think and AF-Chat datasets. AF-Next is trained using a curriculum-based strategy spanning pre-training, mid-training and post-training stages. Extensive experiments across 20 audio understanding and reasoning benchmarks, including challenging long-audio tasks, show that AF-Next outperforms similarly sized open models by large margins and remains highly competitive with and sometimes surpasses, much larger open-weight and closed models. Beyond benchmark performance, AF-Next exhibits strong real-world utility and transfers well to unseen tasks, highlighting its robustness and generalization ability. In addition to all data, code and methods, we open-source 3 variants of AF-Next, including AF-Next-Instruct, AF-Next-Think and AF-Next-Captioner.
Problem

Research questions and friction points this paper is trying to address.

audio-language models
audio understanding
temporal reasoning
long audio
generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Audio Chain-of-Thought
long-audio understanding
audio-language model
scalable dataset curation
fine-grained temporal alignment
🔎 Similar Papers
No similar papers found.
Sreyan Ghosh
Sreyan Ghosh
Ph.D. in CS at University of Maryland, College Park
AIMachine LearningNLPSpeech Recognition
Arushi Goel
Arushi Goel
Research Scientist, NVIDIA
Computer VisionMachine LearningVision and Language
K
Kaousheik Jayakumar
University of Maryland, USA
L
Lasha Koroshinadze
University of Maryland, USA
Nishit Anand
Nishit Anand
MS CS at University of Maryland, College Park
Machine LearningComputer VisionNatural Language ProcessingSpeech Recognition
Zhifeng Kong
Zhifeng Kong
Senior Research Scientist, NVIDIA
Deep Generative ModelsDiffusion ModelsAudio Foundation ModelsAudio LMTrustworthy ML
Siddharth Gururani
Siddharth Gururani
NVIDIA Research
Artificial IntelligenceMusic Information RetrievalMachine LearningDeep LearningText to Speech
Sang-gil Lee
Sang-gil Lee
NVIDIA
Deep Generative ModelAudio SynthesisLanguage Model
Jaehyeon Kim
Jaehyeon Kim
NVIDIA
Machine Learning
A
Aya Aljafari
NVIDIA, USA
Chao-Han Huck Yang
Chao-Han Huck Yang
Sr. Research Scientist, NVIDIA Research
Robust Speech RecognitionLanguage ModelsPost-TrainingSequence Modeling
S
Sungwon Kim
NVIDIA, USA
Ramani Duraiswami
Ramani Duraiswami
Computer Science and UMIACS, University of Maryland
Scientific ComputingSpatial AudioMachine LearningComputational Electromagnetics
Dinesh Manocha
Dinesh Manocha
Distinguished University Professor, University of Maryland at College Park
computer graphicsgeometric modelingmotion planningvirtual realityrobotics
Mohammad Shoeybi
Mohammad Shoeybi
Senior Director of Applied Research at NVIDIA
Large Language ModelsNLPMulti-Modal ModelsGenerative AI
Bryan Catanzaro
Bryan Catanzaro
NVIDIA
Parallel ComputingMachine Learning
Ming-Yu Liu
Ming-Yu Liu
Vice President of Research at NVIDIA, IEEE Fellow
Computer VisionMachine Learning
Wei Ping
Wei Ping
Distinguished Research Scientist, NVIDIA
machine learninglarge language modelsspeech synthesisreinforcement learning