🤖 AI Summary
This work proposes an end-to-end large speech model framework unifying speech understanding and generation for low-latency, high-fidelity real-time spoken dialogue and question answering. Methodologically: (1) it introduces a text-guided aligned speech generation mechanism, integrating 12.5-Hz multi-codebook discretization with dedicated audio attention heads to jointly model semantic and acoustic representations; (2) it devises a novel two-stage collaborative pretraining strategy that leverages ASR priors without compromising the linguistic capabilities of the underlying large language model. The key contribution is the first fully speech-native large model in which understanding and generation share a unified architecture, markedly improving dialogue coherence and naturalness. The model achieves state-of-the-art performance across multiple real-time spoken dialogue and question-answering benchmarks. The model, training code, and data are publicly released.
📝 Abstract
We introduce Baichuan-Audio, an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities. Baichuan-Audio leverages a pre-trained ASR model, followed by multi-codebook discretization of speech at a frame rate of 12.5 Hz. This multi-codebook setup ensures that speech tokens retain both semantic and acoustic information. To further enhance modeling, an independent audio head is employed to process audio tokens, effectively capturing their unique characteristics. To mitigate the loss of intelligence during pre-training and preserve the original capabilities of the LLM, we propose a two-stage pre-training strategy that maintains language understanding while enhancing audio modeling. Following alignment, the model excels in real-time speech-based conversation and exhibits outstanding question-answering capabilities, demonstrating its versatility and efficiency. The proposed model demonstrates superior performance in real-time spoken dialogue and exhibits strong question-answering abilities. Our code, model and training data are available at https://github.com/baichuan-inc/Baichuan-Audio