Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work proposes an end-to-end large speech model framework unifying speech understanding and generation for low-latency, high-fidelity real-time spoken dialogue and question answering. Methodologically: (1) it introduces a text-guided aligned speech generation mechanism, integrating 12.5-Hz multi-codebook discretization with dedicated audio attention heads to jointly model semantic and acoustic representations; (2) it devises a novel two-stage collaborative pretraining strategy that leverages ASR priors without compromising the linguistic capabilities of the underlying large language model. The key contribution is the first fully speech-native large model in which understanding and generation share a unified architecture, markedly improving dialogue coherence and naturalness. The model achieves state-of-the-art performance across multiple real-time spoken dialogue and question-answering benchmarks. The model, training code, and data are publicly released.

Technology Category

Application Category

📝 Abstract

We introduce Baichuan-Audio, an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities. Baichuan-Audio leverages a pre-trained ASR model, followed by multi-codebook discretization of speech at a frame rate of 12.5 Hz. This multi-codebook setup ensures that speech tokens retain both semantic and acoustic information. To further enhance modeling, an independent audio head is employed to process audio tokens, effectively capturing their unique characteristics. To mitigate the loss of intelligence during pre-training and preserve the original capabilities of the LLM, we propose a two-stage pre-training strategy that maintains language understanding while enhancing audio modeling. Following alignment, the model excels in real-time speech-based conversation and exhibits outstanding question-answering capabilities, demonstrating its versatility and efficiency. The proposed model demonstrates superior performance in real-time spoken dialogue and exhibits strong question-answering abilities. Our code, model and training data are available at https://github.com/baichuan-inc/Baichuan-Audio

Problem

Research questions and friction points this paper is trying to address.

End-to-end audio understanding and generation

Real-time speech interaction capabilities

Enhanced audio modeling with language understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end audio LLM integration

Text-guided aligned speech generation

Two-stage pre-training strategy

🔎 Similar Papers

No similar papers found.

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs