Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address high speech acquisition costs, limited dynamic controllability, and insufficient cognitive capabilities of open-source models in real-time spoken human–machine collaboration, this paper introduces the first open-source, self-developed real-time spoken interaction system. Methodologically, we construct a 130B unified multimodal large language model enabling joint speech–text understanding and generation; propose a generative speech data engine coupled with knowledge-distilled lightweight TTS; design instruction-driven fine-grained controllable speech synthesis (supporting dialects, emotions, rapping, and singing); and integrate a cognition-enhanced architecture featuring tool calling and role-playing. Contributions include releasing the Step-Audio-Chat and Step-Audio-TTS-3B models, fully open-sourced code, and the StepEval-Audio-360 evaluation benchmark. Human evaluations show state-of-the-art instruction-following performance, with average improvements of 9.3% on benchmarks including LLaMA Question.

Technology Category

Application Category

📝 Abstract

Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.

Problem

Research questions and friction points this paper is trying to address.

Unified speech-text multi-modal model

Affordable voice cloning framework

Instruction-driven dynamic control system

Innovation

Methods, ideas, or system contributions that make the work stand out.

130B-parameter unified speech-text model

Affordable voice cloning framework

Instruction-driven fine control system

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Member of Technical Staff - Voice Model

xAI

$150,000 - $450,000 USD

Palo Alto, CA / Palo Alto, CA, Palo Alto, California, United States

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs