Optimizing Speech Language Models for Acoustic Consistency

📅 2025-09-30
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
This work addresses acoustic inconsistency in speech language models during speech generation under variable acoustic conditions—including speaker identity, gender, emotion, and environment—without modifying the tokenizer or inference architecture. Our lightweight optimization method introduces self-supervised feature initialization to strengthen semantic–acoustic coupling and proposes a sparse alignment loss (“thinning loss”) jointly optimized with auxiliary objectives to enhance both acoustic stability and semantic fidelity. We validate the approach on two model paradigms: speech-only and interleaved text–speech models. Across three model scales, our method consistently improves cross-speaker, cross-emotion, and cross-environment acoustic consistency. Notably, the speech-only variant achieves superior performance over larger baseline systems while maintaining significantly fewer parameters; the interleaved variant further strengthens semantic–acoustic alignment and syntactic representation capability.

Technology Category

Application Category

📝 Abstract
We study speech language models that incorporate semantic initialization and planning losses to achieve robust and consistent generation. Our approach initializes speech tokens with self-supervised features, applies a light alignment loss, and trains with thinning and auxiliary objectives that target robustness and content planning. We train three models: a 0.7B speech-only model, a 1.0B speech-only model, and a 1.0B interleaved model with both text and speech. Acoustic studies show that the speech-only models achieve the highest consistency across speaker, gender, sentiment, room, and background factors, surpassing larger systems. Interleaving improves lexical and syntactic probes and semantic--acoustic alignment but reduces consistency. Linear probes show that our initialization biases the model toward content structure while trading off prosody detail. These results show that LM-side design and training mix control the balance between acoustic stability and semantic grounding without changes to the tokenizer or runtime architecture. A demo and model weights are available for exploration.
Problem

Research questions and friction points this paper is trying to address.

Achieving robust acoustic consistency in speech generation models
Balancing semantic grounding with acoustic stability in LM training
Optimizing speech token initialization with content structure planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Initializes speech tokens with self-supervised features
Uses thinning and auxiliary objectives for robustness
Applies alignment loss for semantic-acoustic consistency
🔎 Similar Papers
No similar papers found.