🤖 AI Summary
To address high generation latency and the trade-off between audio fidelity and semantic quality in speech large language models (LLMs) during real-time interaction, this paper introduces the VocalNet-1B/8B family and an integrated training-inference framework. Our core methodological innovation is the first adoption of multi-token prediction (MTP) in speech LLMs—replacing conventional single-step next-token prediction—to jointly model speech and text representations while simultaneously improving generation speed, audio quality, and semantic accuracy. The framework is model-agnostic and scalable, integrating speech-text joint representation learning with efficient inference optimization. Experiments demonstrate that VocalNet achieves superior performance over mainstream omnimodal LLMs using significantly less training data and substantially outperforms existing open-source speech LLMs. To foster reproducibility and community advancement, we will fully open-source all models, code, datasets, and the training framework.
📝 Abstract
Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We propose VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework for real-time voice interaction. Departing from the conventional next-token prediction (NTP), we introduce multi-token prediction (MTP), a novel approach optimized for speech LLMs that simultaneously improves generation speed and quality. Experiments show that VocalNet outperforms mainstream Omni LLMs despite using significantly less training data, while also surpassing existing open-source speech LLMs by a substantial margin. To support reproducibility and community advancement, we will open-source all model weights, inference code, training data, and framework implementations upon publication.