VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

📅 2025-04-05

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address high generation latency and the trade-off between audio fidelity and semantic quality in speech large language models (LLMs) during real-time interaction, this paper introduces the VocalNet-1B/8B family and an integrated training-inference framework. Our core methodological innovation is the first adoption of multi-token prediction (MTP) in speech LLMs—replacing conventional single-step next-token prediction—to jointly model speech and text representations while simultaneously improving generation speed, audio quality, and semantic accuracy. The framework is model-agnostic and scalable, integrating speech-text joint representation learning with efficient inference optimization. Experiments demonstrate that VocalNet achieves superior performance over mainstream omnimodal LLMs using significantly less training data and substantially outperforms existing open-source speech LLMs. To foster reproducibility and community advancement, we will fully open-source all models, code, datasets, and the training framework.

Technology Category

Application Category

📝 Abstract

Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We propose VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework for real-time voice interaction. Departing from the conventional next-token prediction (NTP), we introduce multi-token prediction (MTP), a novel approach optimized for speech LLMs that simultaneously improves generation speed and quality. Experiments show that VocalNet outperforms mainstream Omni LLMs despite using significantly less training data, while also surpassing existing open-source speech LLMs by a substantial margin. To support reproducibility and community advancement, we will open-source all model weights, inference code, training data, and framework implementations upon publication.

Problem

Research questions and friction points this paper is trying to address.

Enhancing speech generation speed and quality

Developing scalable speech LLMs for real-time interaction

Reducing training data dependency for better performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-token prediction for speech LLMs

Scalable model-agnostic training framework

High-performance low-latency real-time generation

🔎 Similar Papers

No similar papers found.