Mellow: a small audio language model for reasoning

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the limited inference capability of small audio-language models (ALMs) on edge devices, this paper introduces Mellow, a lightweight ALM. Methodologically, Mellow features: (i) a novel inference-optimized architecture integrating Whisper’s audio encoder, a compact text decoder, and a learnable projection layer; and (ii) the ReasonAQA synthetic dataset—70% of which comprises multi-dimensional audio reasoning questions generated by large language models—leveraged via synthetic data augmentation and instruction fine-tuning. Experimental results show that Mellow achieves 52.11 on the MMAU benchmark, matching the performance of the 8B-parameter Qwen2-Audio (52.5), while using only 1/50 the parameter count and 1/60 the training audio duration. Moreover, Mellow consistently outperforms existing small ALMs on out-of-distribution reasoning tasks, demonstrating superior generalization and efficiency for edge deployment.

Technology Category

Application Category

📝 Abstract

Multimodal Audio-Language Models (ALMs) can understand and reason over both audio and text. Typically, reasoning performance correlates with model size, with the best results achieved by models exceeding 8 billion parameters. However, no prior work has explored enabling small audio-language models to perform reasoning tasks, despite the potential applications for edge devices. To address this gap, we introduce Mellow, a small Audio-Language Model specifically designed for reasoning. Mellow achieves state-of-the-art performance among existing small audio-language models and surpasses several larger models in reasoning capabilities. For instance, Mellow scores 52.11 on MMAU, comparable to SoTA Qwen2 Audio (which scores 52.5) while using 50 times fewer parameters and being trained on 60 times less data (audio hrs). To train Mellow, we introduce ReasonAQA, a dataset designed to enhance audio-grounded reasoning in models. It consists of a mixture of existing datasets (30% of the data) and synthetically generated data (70%). The synthetic dataset is derived from audio captioning datasets, where Large Language Models (LLMs) generate detailed and multiple-choice questions focusing on audio events, objects, acoustic scenes, signal properties, semantics, and listener emotions. To evaluate Mellow's reasoning ability, we benchmark it on a diverse set of tasks, assessing on both in-distribution and out-of-distribution data, including audio understanding, deductive reasoning, and comparative reasoning. Finally, we conduct extensive ablation studies to explore the impact of projection layer choices, synthetic data generation methods, and language model pretraining on reasoning performance. Our training dataset, findings, and baseline pave the way for developing small ALMs capable of reasoning.

Problem

Research questions and friction points this paper is trying to address.

Enabling small audio-language models for reasoning tasks

Introducing Mellow, a small model with superior reasoning capabilities

Developing ReasonAQA dataset to enhance audio-grounded reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mellow: small audio-language model for reasoning

ReasonAQA dataset enhances audio-grounded reasoning

Synthetic data generation using LLMs for training

🔎 Similar Papers

No similar papers found.