VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

📅 2025-05-06

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the high initial-token latency in streaming speech generation—a critical bottleneck for end-to-end deployment—this paper introduces the first end-to-end large speech-language model capable of “audio output upon first forward pass.” Methodologically: (1) we design a lightweight Multi-Cross-Modal Token Prediction (MCTP) module to enable fine-grained interleaved generation of audio and text tokens; (2) we propose a four-stage progressive training strategy integrating streaming autoregressive modeling, cross-modal prediction, model compression, and joint speech–language optimization. Evaluated at the 7B parameter scale, our model achieves 3–5× faster inference speed and significantly outperforms comparably sized open-source models across ASR, TTS, and Speech Question Answering (SQA) benchmarks. This work establishes a scalable new paradigm for low-latency large speech-language models.

Technology Category

Application Category

📝 Abstract

With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.

Problem

Research questions and friction points this paper is trying to address.

Reduces high latency in first audio token generation

Enables real-time conversational capabilities with minimal latency

Improves efficiency in cross-modal audio-text token generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight MCTP module for multi-token generation

Four-stage training for speed without quality loss

First-pass audio output enables real-time conversation

🔎 Similar Papers

SSR: Alignment-Aware Modality Connector for Speech Language Models

2024-09-30arXiv.orgCitations: 3

dMel: Speech Tokenization made Simple

2024-07-22arXiv.orgCitations: 4

Liquid AI

Competitive base salary with equity in a unicorn-stage company

San Francisco / Boston

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs