Music Flamingo: Scaling Music Understanding in Audio Language Models

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Current audio-language models exhibit three fundamental limitations in music understanding: (1) inability to effectively model music’s dynamic, hierarchical, and information-dense nature; (2) scarcity of high-quality, semantically rich annotations; and (3) poor generalization—restricted to shallow description and factoid QA. To address these, we introduce MF-Skills, the first large-scale, multidimensional dataset for deep music understanding, covering harmony, structure, timbre, lyrics, and cultural context. We further propose MF-Think, a music-theory-grounded chain-of-thought data paradigm. Our method employs an enhanced Audio Flamingo 3 architecture, integrating multi-stage annotation, instruction tuning, chain-of-thought cold-start initialization, and GRPO-based reinforcement learning with custom music-specific rewards. Evaluated across 10+ music understanding and reasoning benchmarks, our approach achieves state-of-the-art performance, significantly improving fine-grained musical cognition and cross-cultural generalization—establishing a new paradigm for general-purpose intelligent audio-language models.

Technology Category

Application Category

📝 Abstract

We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model's reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio-language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as meaningfully as humans do.

Problem

Research questions and friction points this paper is trying to address.

Advancing music understanding in audio-language models for dynamic, layered content

Overcoming limitations in music data scarcity and annotation quality

Moving beyond surface-level recognition to human-like music perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned Audio Flamingo backbone on MF-Skills dataset

Introduced chain-of-thought dataset MF-Think for reasoning

Applied GRPO reinforcement learning with custom rewards

🔎 Similar Papers

Unifying Multitrack Music Arrangement via Reconstruction Fine-Tuning and Efficient Tokenization