Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an end-to-end large speech model framework unifying speech understanding and generation for low-latency, high-fidelity real-time spoken dialogue and question answering. Methodologically: (1) it introduces a text-guided aligned speech generation mechanism, integrating 12.5-Hz multi-codebook discretization with dedicated audio attention heads to jointly model semantic and acoustic representations; (2) it devises a novel two-stage collaborative pretraining strategy that leverages ASR priors without compromising the linguistic capabilities of the underlying large language model. The key contribution is the first fully speech-native large model in which understanding and generation share a unified architecture, markedly improving dialogue coherence and naturalness. The model achieves state-of-the-art performance across multiple real-time spoken dialogue and question-answering benchmarks. The model, training code, and data are publicly released.

Technology Category

Application Category

📝 Abstract
We introduce Baichuan-Audio, an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities. Baichuan-Audio leverages a pre-trained ASR model, followed by multi-codebook discretization of speech at a frame rate of 12.5 Hz. This multi-codebook setup ensures that speech tokens retain both semantic and acoustic information. To further enhance modeling, an independent audio head is employed to process audio tokens, effectively capturing their unique characteristics. To mitigate the loss of intelligence during pre-training and preserve the original capabilities of the LLM, we propose a two-stage pre-training strategy that maintains language understanding while enhancing audio modeling. Following alignment, the model excels in real-time speech-based conversation and exhibits outstanding question-answering capabilities, demonstrating its versatility and efficiency. The proposed model demonstrates superior performance in real-time spoken dialogue and exhibits strong question-answering abilities. Our code, model and training data are available at https://github.com/baichuan-inc/Baichuan-Audio
Problem

Research questions and friction points this paper is trying to address.

End-to-end audio understanding and generation
Real-time speech interaction capabilities
Enhanced audio modeling with language understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end audio LLM integration
Text-guided aligned speech generation
Two-stage pre-training strategy
🔎 Similar Papers
No similar papers found.
T
Tianpeng Li
Baichuan Inc.
J
Jun Liu
Baichuan Inc.
T
Tao Zhang
Baichuan Inc.
Y
Yuanbo Fang
Baichuan Inc.
D
Da Pan
Baichuan Inc.
M
Mingrui Wang
Baichuan Inc.
Z
Zheng Liang
Baichuan Inc.
Z
Zehuan Li
Baichuan Inc.
Mingan Lin
Mingan Lin
baichuan-inc
LLM、MLLM、AI
G
Guosheng Dong
Baichuan Inc.
Jianhua Xu
Jianhua Xu
University of Electronic Science and Technology of China
Multi-Agent、Evolutionary Games、LLM-Agents
Haoze Sun
Haoze Sun
Tsinghua University
Low-level image processingImage super-resolutionDiffusion generation model
Z
Zenan Zhou
Baichuan Inc.
W
Weipeng Chen
Baichuan Inc.