Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Current audio-language models lack explicit chain-of-thought (CoT) reasoning capabilities, limiting their performance on advanced auditory understanding tasks such as sound commonsense reasoning and fine-grained discrimination. To address this, we introduce CoT reasoning into audio-language modeling for the first time. We propose AF-Reasoning-Eval—the first dedicated benchmark for sound reasoning—and AF-CoT-Train—the first large-scale, automatically constructed sound reasoning training dataset. We design an end-to-end data transformation pipeline that converts raw audio question-answering and classification data into structured reasoning chains, and perform CoT fine-tuning on Audio Flamingo-based models. Experiments demonstrate significant performance gains across multiple sound reasoning benchmarks, validating the effectiveness of explicit reasoning chains for acoustic cognitive modeling. This work advances audio understanding from perceptual recognition toward systematic, interpretable reasoning.

Technology Category

Application Category

📝 Abstract

Chain-of-thought reasoning has demonstrated significant improvements in large language models and vision language models, yet its potential for audio language models remains largely unexplored. In this technical report, we take a preliminary step towards closing this gap. For better assessment of sound reasoning, we propose AF-Reasoning-Eval, a benchmark targeting common-sense reasoning and the ability to discriminate among closely related choices. To prepare training corpus for sound reasoning abilities, we propose automatic pipelines that transform existing audio question answering and classification data into explicit reasoning chains, yielding AF-CoT-Train with 1.24M samples. We study the effect of finetuning Audio Flamingo series on AF-CoT-Train and observe considerable improvements on several reasoning benchmarks, validating the effectiveness of chain-of-thought finetuning on advanced sound understanding.

Problem

Research questions and friction points this paper is trying to address.

Enhancing chain-of-thought reasoning for audio language models

Developing benchmark AF-Reasoning-Eval for sound understanding evaluation

Creating AF-CoT-Train dataset with 1.24M reasoning chain samples

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed AF-Reasoning-Eval benchmark for sound reasoning

Created AF-CoT-Train corpus with 1.24M reasoning samples

Finetuned Audio Flamingo models using chain-of-thought training

🔎 Similar Papers

Computer Audition: From Task-Specific Machine Learning to Foundation Models