Step-Audio 2 Technical Report

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in industrial-grade speech understanding and dialogue systems—including low ASR accuracy, weak modeling of paralinguistic information (e.g., emotion, speaking style), and severe hallucination—by proposing the first end-to-end multimodal architecture that directly integrates discrete audio token generation into a large language model. Methodologically, it combines a latent audio encoder, inference-oriented reinforcement learning, and retrieval-augmented generation (RAG) to enable emotion-aware prosody modeling, timbre-controllable speech synthesis, and invocation of external tools (e.g., web/audio search). Trained end-to-end on massive real-world speech data, the model significantly mitigates hallucination. Experiments demonstrate state-of-the-art performance across multiple audio understanding and spoken dialogue benchmarks—surpassing both open-source and commercial baselines—while achieving high recognition accuracy, expressive response generation, and strong cross-scenario generalization.

Technology Category

Application Category

📝 Abstract
This paper presents Step-Audio~2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.
Problem

Research questions and friction points this paper is trying to address.

Develops an end-to-end multi-modal model for audio understanding.
Enhances speech conversation with paralinguistic information handling.
Integrates retrieval-augmented generation to reduce hallucination in responses.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent audio encoder with reinforcement learning
Discrete audio tokens for paralinguistic responsiveness
Retrieval-augmented generation with external tools
🔎 Similar Papers
No similar papers found.