VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

📅 2025-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high generation latency and the trade-off between audio fidelity and semantic quality in speech large language models (LLMs) during real-time interaction, this paper introduces the VocalNet-1B/8B family and an integrated training-inference framework. Our core methodological innovation is the first adoption of multi-token prediction (MTP) in speech LLMs—replacing conventional single-step next-token prediction—to jointly model speech and text representations while simultaneously improving generation speed, audio quality, and semantic accuracy. The framework is model-agnostic and scalable, integrating speech-text joint representation learning with efficient inference optimization. Experiments demonstrate that VocalNet achieves superior performance over mainstream omnimodal LLMs using significantly less training data and substantially outperforms existing open-source speech LLMs. To foster reproducibility and community advancement, we will fully open-source all models, code, datasets, and the training framework.

Technology Category

Application Category

📝 Abstract
Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We propose VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework for real-time voice interaction. Departing from the conventional next-token prediction (NTP), we introduce multi-token prediction (MTP), a novel approach optimized for speech LLMs that simultaneously improves generation speed and quality. Experiments show that VocalNet outperforms mainstream Omni LLMs despite using significantly less training data, while also surpassing existing open-source speech LLMs by a substantial margin. To support reproducibility and community advancement, we will open-source all model weights, inference code, training data, and framework implementations upon publication.
Problem

Research questions and friction points this paper is trying to address.

Enhancing speech generation speed and quality
Developing scalable speech LLMs for real-time interaction
Reducing training data dependency for better performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-token prediction for speech LLMs
Scalable model-agnostic training framework
High-performance low-latency real-time generation
🔎 Similar Papers
No similar papers found.
Y
Yuhao Wang
Shanghai Jiao Tong University
Heyang Liu
Heyang Liu
Shanghai Jiao Tong University
ASRMultimodal understanding
Ziyang Cheng
Ziyang Cheng
University of Electronic Science and Technology of China
R
Ronghua Wu
Ant Group
Q
Qunshan Gu
Ant Group
Yanfeng Wang
Yanfeng Wang
Shanghai Jiao Tong University
Y
Yu Wang
Shanghai Jiao Tong University