Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high speech acquisition costs, limited dynamic controllability, and insufficient cognitive capabilities of open-source models in real-time spoken human–machine collaboration, this paper introduces the first open-source, self-developed real-time spoken interaction system. Methodologically, we construct a 130B unified multimodal large language model enabling joint speech–text understanding and generation; propose a generative speech data engine coupled with knowledge-distilled lightweight TTS; design instruction-driven fine-grained controllable speech synthesis (supporting dialects, emotions, rapping, and singing); and integrate a cognition-enhanced architecture featuring tool calling and role-playing. Contributions include releasing the Step-Audio-Chat and Step-Audio-TTS-3B models, fully open-sourced code, and the StepEval-Audio-360 evaluation benchmark. Human evaluations show state-of-the-art instruction-following performance, with average improvements of 9.3% on benchmarks including LLaMA Question.

Technology Category

Application Category

📝 Abstract
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
Problem

Research questions and friction points this paper is trying to address.

Unified speech-text multi-modal model
Affordable voice cloning framework
Instruction-driven dynamic control system
Innovation

Methods, ideas, or system contributions that make the work stand out.

130B-parameter unified speech-text model
Affordable voice cloning framework
Instruction-driven fine control system
🔎 Similar Papers
No similar papers found.
A
Ailin Huang
B
Boyong Wu
B
Bruce Wang
Chao Yan
Chao Yan
Instructor at DBMI, VUMC; CS PhD from Vanderbilt U
AI for medicineSynthetic health dataPrivacyFairness
Chen Hu
Chen Hu
School of Artificial Intelligence and Computer Science, Jiangnan University
Geometric Deep LearningMachine Learning
C
Chengli Feng
F
Fei Tian
Feiyu Shen
Feiyu Shen
Shanghai Jiao Tong University
text-to-speech synthesis
Jingbei Li
Jingbei Li
Tsinghua University
Mingrui Chen
Mingrui Chen
Institute of Automation, Chinese Academy of Sciences
Computer VisionFoundation Models
P
Peng Liu
R
Ruihang Miao
W
Wang You
X
Xi Chen
X
Xuerui Yang
Y
Yechang Huang
Y
Yuxiang Zhang
Z
Zheng Gong
Zixin Zhang
Zixin Zhang
Hong Kong University of Science and Technology (GZ)
Computer Vision
H
Hongyu Zhou
Jianjian Sun
Jianjian Sun
Researcher of StepFun
LLMMulti-modal
B
Brian Li
C
Chengting Feng
C
Changyi Wan
Hanpeng Hu
Hanpeng Hu
The University of Hong Kong
Distributed MLML Diagnosis and Optimization
Jianchang Wu
Jianchang Wu
J
Jiangjie Zhen
R
Ranchen Ming
Song Yuan
Song Yuan
Zhejiang University, CAGE
Development EconomicsInternational EconomicsPolitical EconomyEconomic History
X
Xuelin Zhang
Y
Yu Zhou
B
Bingxin Li
B
Buyun Ma
H
Hongyuan Wang
K
Kang An
W
Wei Ji
W
Wen Li
X
Xuan Wen
X
Xiangwen Kong
Y
Yuankai Ma
Y
Yuanwei Liang
Y
Yun-Fei Mou
B
Bahtiyar Ahmidi
B
Bin Wang
B
Bo Li
C
Changxing Miao
C
Chen Xu
C
Chenrun Wang
D
Dapeng Shi
D
Deshan Sun
D
Dingyuan Hu
D
Dula Sai
E
Enle Liu
G
Guanzhe Huang
G
Gulin Yan
H
Heng Wang
H
Haonan Jia
Haoyang Zhang
Haoyang Zhang
Ph.D. student of Computer Science, University of Illinois Urbana-Champaign
Computer ArchitectureSystem Software
J
Jiahao Gong
J
Junjing Guo
J
Jiashuai Liu
J
Jiahong Liu
J
Jie Feng
J
Jie Wu
J
Jiaoren Wu
J
Jie Yang
J
Jinguo Wang
J
Jingyang Zhang
J
Junzhe Lin
K
Kaixiang Li
L
Lei Xia
L
Li Zhou
L
Liang Zhao
L
Longlong Gu
M
Mei Chen
M
Menglin Wu
M
Ming Li
M
Mingxiao Li
Mingliang Li
Mingliang Li
Tsinghua University
Computer System
M
Mingyao Liang
N
Na Wang
N
Nie Hao
Q
Qiling Wu
Q
Qi-Liang Tan
R
Ran Sun
S
Shuai Shuai
S
Shaoliang Pang
S
Shiliang Yang
S
Shuli Gao
S
Shanshan Yuan
S
Siqi Liu
Shihong Deng
Shihong Deng
Bytedance Technology
Artificial Intelligence
S
Shilei Jiang
Sitong Liu
Sitong Liu
Duke University
Tiancheng Cao
Tiancheng Cao
Schmidt AI in Science Fellow, CSIE, Nanyang Technological University
Neuromorphic computingEdge IntelligenceInternet of Medical Things (IoMT)Translational medicine
T
Tianyu Wang
W
Wenjin Deng
W
Wuxun Xie
W
Weipeng Ming
W
Wenqing He
W
Wen Sun
X
Xin Han
X
Xin Huang
X
Xiaomin Deng
X
Xiaojia Liu
X
Xin Wu
X
Xu Zhao
Y
Yanan Wei
Y
Yanbo Yu
Y
Yang Cao
Yangguang Li
Yangguang Li
CUHK
GenAIComputer GraphicsComputer Vision
Y
Yangzhen Ma
Y
Yanming Xu
Y
Yaoyu Wang
Y
Yaqiang Shi
Yilei Wang
Yilei Wang
Alibaba Cloud
Y
Yizhuang Zhou
Yinmin Zhong
Yinmin Zhong
Peking University
Machine Learning SystemDistributed System
Y
Yang Zhang
Y
Yaoben Wei
Y
Yu Luo
Y
Yuanwei Lu
Y
Yuhe Yin
Y
Yuchu Luo
Y
Yuanhao Ding
Yuting Yan
Yuting Yan
Nanjing University
Edge IntelligenceAI SystemVideo Analytics System
Y
Yaqi Dai
Y
Yuxiang Yang
Zhe Xie
Zhe Xie
Tsinghua University, Shanghai Jiao Tong University
Anomaly DetectionTime SeriesAIOpsLLM
Zheng Ge
Zheng Ge
Senior Researcher, StepFun
Multimodal Models Perception and Reasoning
Z
Zheng Sun
Z
Zhewei Huang
Z
Zhichao Chang
Z
Zhi-Ying Guan
Z
Zidong Yang
Zili Zhang
Zili Zhang
Peking University
Distributed systemDeep learning
B
Binxing Jiao
Daxin Jiang
Daxin Jiang
Co-Founder & CEO, StepFun Corporation
Deep LearningFoundation Models
H
H. Shum
Jiansheng Chen
Jiansheng Chen
School of Computer and Communication Engineering, University of Science and Technology Beijing
Computer VisionMachine Learning
J
Jing Li
Shuchang Zhou
Shuchang Zhou
Megvii Inc.
Artificial Intelligence
X
Xiangyu Zhang
Xinhao Zhang
Xinhao Zhang
PHD student, Portland State University
Data MiningReinforcement Learning
Y
Yibo Zhu