Step-Audio-R1 Technical Report

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Can audio-language models perform genuinely acoustics-driven deep reasoning? This paper introduces the first audio-language model capable of cross-type reasoning over speech, environmental sounds, and music. We propose Modality-Grounded Reasoning Distillation (MGRD), a novel framework that enforces explicit reliance on raw acoustic representations during inference—via chain-of-thought training and fine-grained audio feature alignment—thereby mitigating hallucination. MGRD is the first method to enable interpretable and verifiable reasoning chains in the audio modality. Experiments demonstrate that our model surpasses Gemini 2.5 Pro and matches Gemini 3 Pro across multiple audio understanding and reasoning benchmarks, validating the strong generalizability and transferability of acoustics-grounded cross-modal reasoning.

Technology Category

Application Category

📝 Abstract
Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.
Problem

Research questions and friction points this paper is trying to address.

Audio language models perform poorly with extended reasoning chains
Unlocking genuine reasoning capabilities in the audio domain
Grounding audio reasoning in acoustic features to prevent hallucinations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-Grounded Reasoning Distillation framework for audio
Generates acoustic feature-based reasoning chains
Achieves state-of-the-art audio understanding across multiple domains
🔎 Similar Papers
No similar papers found.
F
Fei Tian
StepFun-Audio Team
X
Xiangyu Tony Zhang
StepFun-Audio Team
Y
Yuxin Zhang
StepFun-Audio Team
Haoyang Zhang
Haoyang Zhang
Ph.D. student of Computer Science, University of Illinois Urbana-Champaign
Computer ArchitectureSystem Software
Y
Yuxin Li
StepFun-Audio Team
D
Daijiao Liu
StepFun-Audio Team
Yayue Deng
Yayue Deng
Beijing University of Posts and Telecommunications
Speech SynthesisSpeech ProcessingLLMMachine Learning
D
Donghang Wu
StepFun-Audio Team
J
Jun Chen
StepFun-Audio Team
L
Liang Zhao
StepFun-Audio Team
Chengyuan Yao
Chengyuan Yao
Columbia University
Educational Data ScienceTransfer LearningAlgorithmic fairness
Hexin Liu
Hexin Liu
Nanyang Technological University
Speech recognitionlanguage identification
E
Eng Siong Chng
StepFun-Audio Team
X
Xuerui Yang
StepFun-Audio Team
X
Xiangyu Zhang
StepFun-Audio Team
Daxin Jiang
Daxin Jiang
Co-Founder & CEO, StepFun Corporation
Deep LearningFoundation Models
G
Gang Yu
StepFun-Audio Team