🤖 AI Summary
Current audio-language models lack explicit chain-of-thought (CoT) reasoning capabilities, limiting their performance on advanced auditory understanding tasks such as sound commonsense reasoning and fine-grained discrimination. To address this, we introduce CoT reasoning into audio-language modeling for the first time. We propose AF-Reasoning-Eval—the first dedicated benchmark for sound reasoning—and AF-CoT-Train—the first large-scale, automatically constructed sound reasoning training dataset. We design an end-to-end data transformation pipeline that converts raw audio question-answering and classification data into structured reasoning chains, and perform CoT fine-tuning on Audio Flamingo-based models. Experiments demonstrate significant performance gains across multiple sound reasoning benchmarks, validating the effectiveness of explicit reasoning chains for acoustic cognitive modeling. This work advances audio understanding from perceptual recognition toward systematic, interpretable reasoning.
📝 Abstract
Chain-of-thought reasoning has demonstrated significant improvements in large language models and vision language models, yet its potential for audio language models remains largely unexplored. In this technical report, we take a preliminary step towards closing this gap. For better assessment of sound reasoning, we propose AF-Reasoning-Eval, a benchmark targeting common-sense reasoning and the ability to discriminate among closely related choices. To prepare training corpus for sound reasoning abilities, we propose automatic pipelines that transform existing audio question answering and classification data into explicit reasoning chains, yielding AF-CoT-Train with 1.24M samples. We study the effect of finetuning Audio Flamingo series on AF-CoT-Train and observe considerable improvements on several reasoning benchmarks, validating the effectiveness of chain-of-thought finetuning on advanced sound understanding.