Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

πŸ“… 2025-05-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work investigates the necessity of audio modality in fine-tuning audio-capable large language models (LLMs). Building upon the Qwen2.5-Omni multimodal architecture, we apply the GRPO reinforcement learning algorithm to fine-tune the model exclusively on text-only audio question-answering dataβ€”without access to raw audio inputs. Our experiments demonstrate that text-only fine-tuning substantially improves performance on audio understanding tasks; the GRPO gains stem primarily from enhanced textual reasoning rather than explicit audio feature modeling, as confirmed via ablation studies with controlled modality inputs. On the MMAU benchmark (under both Test-mini and Test-full splits), our approach establishes new state-of-the-art results across sound, music, speech, and overall average accuracy. This is the first study to empirically verify that high-quality textual instruction data alone can elicit implicit audio comprehension capabilities in multimodal LLMs, offering a novel, low-resource paradigm for audio AI training.

Technology Category

Application Category

πŸ“ Abstract
We propose Omni-R1 which fine-tunes a recent multi-modal LLM, Qwen2.5-Omni, on an audio question answering dataset with the reinforcement learning method GRPO. This leads to new State-of-the-Art performance on the recent MMAU benchmark. Omni-R1 achieves the highest accuracies on the sounds, music, speech, and overall average categories, both on the Test-mini and Test-full splits. To understand the performance improvement, we tested models both with and without audio and found that much of the performance improvement from GRPO could be attributed to better text-based reasoning. We also made a surprising discovery that fine-tuning without audio on a text-only dataset was effective at improving the audio-based performance.
Problem

Research questions and friction points this paper is trying to address.

Fine-tuning audio LLMs without using audio data
Improving audio question answering via text-based reasoning
Achieving state-of-the-art performance on MMAU benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses GRPO reinforcement learning for fine-tuning
Achieves SOTA on MMAU benchmark without audio
Text-only fine-tuning boosts audio performance
πŸ”Ž Similar Papers
No similar papers found.