Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large audio-language models (LALMs) suffer from a fundamental bottleneck—reliance on text-based intermediaries—which impedes direct, natural speech generation in response to audio inputs. Method: We propose the first end-to-end Audio Question Answering and Audio-generation (AQAA) large model. Our approach features a dual-codebook audio tokenizer, a joint architecture integrating a 130B language model with a neural vocoder, interleaved text/audio token post-training, and a hybrid optimization strategy combining Direct Preference Optimization (DPO) with model ensembling. Contribution/Results: AQAA breaks the conventional “audio → text → speech” pipeline, enabling direct audio-to-audio generation with natural prosody and semantic fidelity. On the StepEval-Audio-360 benchmark, it achieves state-of-the-art performance in speech controllability while maintaining high audio fidelity and semantic coherence—demonstrating both architectural novelty and practical efficacy.

Technology Category

Application Category

📝 Abstract
Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.
Problem

Research questions and friction points this paper is trying to address.

Generates natural speech responses directly for audio interactions
Integrates dual-codebook audio tokenizer for feature extraction
Enhances semantic coherence with interleaved text-audio token output
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-codebook audio tokenizer for feature extraction
130-billion-parameter LLM with neural vocoder
DPO and model merge for enhanced performance
🔎 Similar Papers
No similar papers found.
A
Ailin Huang
B
Bingxin Li
B
Bruce Wang
B
Boyong Wu
Chao Yan
Chao Yan
Instructor at DBMI, VUMC; CS PhD from Vanderbilt U
AI for medicineSynthetic health dataPrivacyFairness
C
Chengli Feng
H
Heng Wang
H
Hongyu Zhou
H
Hongyuan Wang
Jingbei Li
Jingbei Li
Tsinghua University
J
Jian‐Yuan Sun
J
Joanna Wang
Mingrui Chen
Mingrui Chen
Institute of Automation, Chinese Academy of Sciences
Computer VisionFoundation Models
P
Peng Liu
R
Ruihang Miao
S
Shilei Jiang
T
Tian Fei
W
Wang You
X
Xi Chen
X
Xue-Ting Yang
Y
Yechang Huang
Y
Yuxiang Zhang
Zheng Ge
Zheng Ge
Senior Researcher, StepFun
Multimodal Models Perception and Reasoning
Z
Zheng Gong
Z
Zhewei Huang
Zixin Zhang
Zixin Zhang
Hong Kong University of Science and Technology (GZ)
Computer Vision
B
Bin Wang
B
Bo Li
B
Buyun Ma
C
Changxin Miao
C
Changyi Wan
C
Chen Xu
D
Dapeng Shi
D
Dingyuan Hu
E
Enle Liu
G
Guanzhe Huang
G
Gulin Yan
Hanpeng Hu
Hanpeng Hu
The University of Hong Kong
Distributed MLML Diagnosis and Optimization
H
Haonan Jia
J
Jiahao Gong
J
Jiao Wu
J
Jie Wu
J
Jie Yang
J
Junzhe Lin
K
Kaixiang Li
L
Lei Xia
L
Longlong Gu
M
Ming Li
N
Nie Hao
R
Ranchen Ming
S
Shaoliang Pang
S
Siqi Liu
Song Yuan
Song Yuan
Zhejiang University, CAGE
Development EconomicsInternational EconomicsPolitical EconomyEconomic History
Tiancheng Cao
Tiancheng Cao
Schmidt AI in Science Fellow, CSIE, Nanyang Technological University
Neuromorphic computingEdge IntelligenceInternet of Medical Things (IoMT)Translational medicine
W
Wen Li
W
Wenqing He
X
Xu Zhao
X
Xuelin Zhang
Y
Yanbo Yu
Yinmin Zhong
Yinmin Zhong
Peking University
Machine Learning SystemDistributed System
Y
Yu Zhou
Y
Yuanwei Liang
Y
Yuanwei Lu
Y
Yuxiang Yang
Z
Zidong Yang
Zili Zhang
Zili Zhang
Peking University
Distributed systemDeep learning
B
Binxing Jiao
H
H. Shum
Jiansheng Chen
Jiansheng Chen
School of Computer and Communication Engineering, University of Science and Technology Beijing
Computer VisionMachine Learning
J
Jing Li
X
Xiangyu Zhang
Xinhao Zhang
Xinhao Zhang
PHD student, Portland State University
Data MiningReinforcement Learning
Y
Yibo Zhu
Daxin Jiang
Daxin Jiang
Co-Founder & CEO, StepFun Corporation
Deep LearningFoundation Models
Shuchang Zhou
Shuchang Zhou
Megvii Inc.
Artificial Intelligence
Chen Hu
Chen Hu
School of Artificial Intelligence and Computer Science, Jiangnan University
Geometric Deep LearningMachine Learning