π€ AI Summary
This work investigates the necessity of audio modality in fine-tuning audio-capable large language models (LLMs). Building upon the Qwen2.5-Omni multimodal architecture, we apply the GRPO reinforcement learning algorithm to fine-tune the model exclusively on text-only audio question-answering dataβwithout access to raw audio inputs. Our experiments demonstrate that text-only fine-tuning substantially improves performance on audio understanding tasks; the GRPO gains stem primarily from enhanced textual reasoning rather than explicit audio feature modeling, as confirmed via ablation studies with controlled modality inputs. On the MMAU benchmark (under both Test-mini and Test-full splits), our approach establishes new state-of-the-art results across sound, music, speech, and overall average accuracy. This is the first study to empirically verify that high-quality textual instruction data alone can elicit implicit audio comprehension capabilities in multimodal LLMs, offering a novel, low-resource paradigm for audio AI training.
π Abstract
We propose Omni-R1 which fine-tunes a recent multi-modal LLM, Qwen2.5-Omni, on an audio question answering dataset with the reinforcement learning method GRPO. This leads to new State-of-the-Art performance on the recent MMAU benchmark. Omni-R1 achieves the highest accuracies on the sounds, music, speech, and overall average categories, both on the Test-mini and Test-full splits. To understand the performance improvement, we tested models both with and without audio and found that much of the performance improvement from GRPO could be attributed to better text-based reasoning. We also made a surprising discovery that fine-tuning without audio on a text-only dataset was effective at improving the audio-based performance.