VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

📅 2025-05-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high initial-token latency in streaming speech generation—a critical bottleneck for end-to-end deployment—this paper introduces the first end-to-end large speech-language model capable of “audio output upon first forward pass.” Methodologically: (1) we design a lightweight Multi-Cross-Modal Token Prediction (MCTP) module to enable fine-grained interleaved generation of audio and text tokens; (2) we propose a four-stage progressive training strategy integrating streaming autoregressive modeling, cross-modal prediction, model compression, and joint speech–language optimization. Evaluated at the 7B parameter scale, our model achieves 3–5× faster inference speed and significantly outperforms comparably sized open-source models across ASR, TTS, and Speech Question Answering (SQA) benchmarks. This work establishes a scalable new paradigm for low-latency large speech-language models.

Technology Category

Application Category

📝 Abstract
With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.
Problem

Research questions and friction points this paper is trying to address.

Reduces high latency in first audio token generation
Enables real-time conversational capabilities with minimal latency
Improves efficiency in cross-modal audio-text token generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight MCTP module for multi-token generation
Four-stage training for speed without quality loss
First-pass audio output enables real-time conversation
🔎 Similar Papers
No similar papers found.