đ¤ AI Summary
This work addresses acoustic inconsistency in speech language models during speech generation under variable acoustic conditionsâincluding speaker identity, gender, emotion, and environmentâwithout modifying the tokenizer or inference architecture. Our lightweight optimization method introduces self-supervised feature initialization to strengthen semanticâacoustic coupling and proposes a sparse alignment loss (âthinning lossâ) jointly optimized with auxiliary objectives to enhance both acoustic stability and semantic fidelity. We validate the approach on two model paradigms: speech-only and interleaved textâspeech models. Across three model scales, our method consistently improves cross-speaker, cross-emotion, and cross-environment acoustic consistency. Notably, the speech-only variant achieves superior performance over larger baseline systems while maintaining significantly fewer parameters; the interleaved variant further strengthens semanticâacoustic alignment and syntactic representation capability.
đ Abstract
We study speech language models that incorporate semantic initialization and planning losses to achieve robust and consistent generation. Our approach initializes speech tokens with self-supervised features, applies a light alignment loss, and trains with thinning and auxiliary objectives that target robustness and content planning. We train three models: a 0.7B speech-only model, a 1.0B speech-only model, and a 1.0B interleaved model with both text and speech. Acoustic studies show that the speech-only models achieve the highest consistency across speaker, gender, sentiment, room, and background factors, surpassing larger systems. Interleaving improves lexical and syntactic probes and semantic--acoustic alignment but reduces consistency. Linear probes show that our initialization biases the model toward content structure while trading off prosody detail. These results show that LM-side design and training mix control the balance between acoustic stability and semantic grounding without changes to the tokenizer or runtime architecture. A demo and model weights are available for exploration.