Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the performance limitations of existing audio language models on long-form audio and complex semantic tasks by introducing a new open-source audio language model. Leveraging a large-scale, diverse audio dataset exceeding one million hours, the model is trained via a staged curriculum learning strategy and a multi-phase training pipeline encompassing pre-training, mid-training, and post-training stages, enabling it to process audio inputs up to 30 minutes in duration. A key innovation is the proposed Temporal Audio Chain-of-Thought reasoning paradigm, which explicitly aligns intermediate reasoning steps with corresponding audio timestamps. Evaluated across 20 audio understanding and reasoning benchmarks, the model significantly outperforms current open-source alternatives and even surpasses larger closed-source systems on several metrics, demonstrating exceptional generalization capability and practical potential.

Technology Category

Application Category

📝 Abstract

We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of Audio Flamingo 3 to identify key gaps in audio understanding and reasoning. We then curate and scale new large-scale datasets totaling over 1 million hours to address these limitations and expand the existing AudioSkills-XL, LongAudio-XL, AF-Think and AF-Chat datasets. AF-Next is trained using a curriculum-based strategy spanning pre-training, mid-training and post-training stages. Extensive experiments across 20 audio understanding and reasoning benchmarks, including challenging long-audio tasks, show that AF-Next outperforms similarly sized open models by large margins and remains highly competitive with and sometimes surpasses, much larger open-weight and closed models. Beyond benchmark performance, AF-Next exhibits strong real-world utility and transfers well to unseen tasks, highlighting its robustness and generalization ability. In addition to all data, code and methods, we open-source 3 variants of AF-Next, including AF-Next-Instruct, AF-Next-Think and AF-Next-Captioner.

Problem

Research questions and friction points this paper is trying to address.

audio-language models

audio understanding

temporal reasoning

long audio

generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Audio Chain-of-Thought

long-audio understanding

audio-language model