Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

📅 2025-05-14

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work investigates the necessity of audio modality in fine-tuning audio-capable large language models (LLMs). Building upon the Qwen2.5-Omni multimodal architecture, we apply the GRPO reinforcement learning algorithm to fine-tune the model exclusively on text-only audio question-answering data—without access to raw audio inputs. Our experiments demonstrate that text-only fine-tuning substantially improves performance on audio understanding tasks; the GRPO gains stem primarily from enhanced textual reasoning rather than explicit audio feature modeling, as confirmed via ablation studies with controlled modality inputs. On the MMAU benchmark (under both Test-mini and Test-full splits), our approach establishes new state-of-the-art results across sound, music, speech, and overall average accuracy. This is the first study to empirically verify that high-quality textual instruction data alone can elicit implicit audio comprehension capabilities in multimodal LLMs, offering a novel, low-resource paradigm for audio AI training.

Technology Category

Application Category

📝 Abstract

We propose Omni-R1 which fine-tunes a recent multi-modal LLM, Qwen2.5-Omni, on an audio question answering dataset with the reinforcement learning method GRPO. This leads to new State-of-the-Art performance on the recent MMAU benchmark. Omni-R1 achieves the highest accuracies on the sounds, music, speech, and overall average categories, both on the Test-mini and Test-full splits. To understand the performance improvement, we tested models both with and without audio and found that much of the performance improvement from GRPO could be attributed to better text-based reasoning. We also made a surprising discovery that fine-tuning without audio on a text-only dataset was effective at improving the audio-based performance.

Problem

Research questions and friction points this paper is trying to address.

Fine-tuning audio LLMs without using audio data

Improving audio question answering via text-based reasoning

Achieving state-of-the-art performance on MMAU benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses GRPO reinforcement learning for fine-tuning

Achieves SOTA on MMAU benchmark without audio

Text-only fine-tuning boosts audio performance

🔎 Similar Papers

No similar papers found.