A Variational Framework for Improving Naturalness in Generative Spoken Language Models

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Generative spoken language models discretize speech into semantic tokens but neglect paralinguistic information—such as prosody—resulting in synthetic speech with limited naturalness. Existing pitch-augmentation approaches are constrained by hand-crafted feature engineering and incomplete paralinguistic representations. To address this, we propose an end-to-end variational framework that, for the first time, enables implicit, unsupervised modeling of continuous paralinguistic attributes—including intonation, rhythm, and emotion—without manual intervention. Our method jointly optimizes semantic tokens and continuous latent variables within a variational autoencoder (VAE), automatically infusing multidimensional prosodic information during self-supervised speech–language joint modeling. Human evaluation demonstrates statistically significant improvements over baselines (p < 0.01). We publicly release code, pretrained models, and audio samples, confirming the method’s effectiveness, generalizability, and reproducibility.

Technology Category

Application Category

📝 Abstract

The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering. To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens. Our approach eliminates the need for manual extraction and selection of paralinguistic features. Moreover, it produces preferred speech continuations according to human raters. Code, samples and models are available at https://github.com/b04901014/vae-gslm.

Problem

Research questions and friction points this paper is trying to address.

Improving speech naturalness in generative spoken language models

Addressing prosodic information neglect in semantic speech tokens

Automating paralinguistic feature encoding without manual extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational framework enhances speech naturalness

Automatically encodes continuous speech attributes

Eliminates manual feature extraction

🔎 Similar Papers

MAD Speech: Measures of Acoustic Diversity of Speech