Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing textless spoken language models (SLMs) model only semantic discrete tokens and rely on external vocoders for speech synthesis, leading to loss of acoustic context and limited control over fine-grained prosodic details. This work proposes the first end-to-end textless SLM framework that jointly models semantic discrete tokens and continuous acoustic frames. Specifically, semantic embeddings guide a flow-matching process to predict high-fidelity acoustic features, while a multi-step semantic prediction mechanism enhances linguistic coherence. The model operates entirely without textual supervision or external vocoders and supports prompt-driven end-to-end training. Experiments demonstrate that it achieves language modeling likelihood competitive with state-of-the-art (SOTA) textless SLMs, while significantly improving acoustic detail fidelity and speech naturalness. This establishes a new paradigm for textless speech generation—unifying semantic and acoustic modeling within a single, vocoder-free architecture.

Technology Category

Application Category

📝 Abstract

Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation.

Problem

Research questions and friction points this paper is trying to address.

Jointly model linguistic and acoustic information in speech

Overcome lack of acoustic context in textless SLMs

Improve control over acoustic details in generated speech

Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly models linguistic and acoustic information

Uses flow-matching objective for prediction

Predicts multiple future semantic tokens

🔎 Similar Papers

No similar papers found.