🤖 AI Summary
To address high speech acquisition costs, limited dynamic controllability, and insufficient cognitive capabilities of open-source models in real-time spoken human–machine collaboration, this paper introduces the first open-source, self-developed real-time spoken interaction system. Methodologically, we construct a 130B unified multimodal large language model enabling joint speech–text understanding and generation; propose a generative speech data engine coupled with knowledge-distilled lightweight TTS; design instruction-driven fine-grained controllable speech synthesis (supporting dialects, emotions, rapping, and singing); and integrate a cognition-enhanced architecture featuring tool calling and role-playing. Contributions include releasing the Step-Audio-Chat and Step-Audio-TTS-3B models, fully open-sourced code, and the StepEval-Audio-360 evaluation benchmark. Human evaluations show state-of-the-art instruction-following performance, with average improvements of 9.3% on benchmarks including LLaMA Question.
📝 Abstract
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.