Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels

📅 2025-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation in text-to-speech language models (TSLMs) where cross-modal knowledge transfer is hindered by abstraction-level mismatch between text and speech representations. We propose an inter-layer abstraction alignment mechanism: building upon a pretrained text language model, we introduce learnable cross-modal projection layers and hierarchical adapters to align the abstraction levels of text and speech representations depth-wise—marking the first such fine-grained alignment, surpassing coarse-grained vocabulary-extension paradigms. This enables effective cross-modal function reuse and joint modeling of text and speech representations. The resulting SmolTolk model achieves state-of-the-art or competitive performance on ASR, TTS, and cross-modal retrieval tasks, despite incurring only ~10% of the computational cost of existing SOTA TSLMs.

Technology Category

Application Category

📝 Abstract
Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- aim to enable cross-modal knowledge transfer to overcome the scaling limitations of unimodal speech LMs. The predominant approach to TSLM training expands the vocabulary of a pre-trained text LM by appending new embeddings and linear projections for speech, followed by fine-tuning on speech data. We hypothesize that this method limits cross-modal transfer by neglecting feature compositionality, preventing text-learned functions from being fully leveraged at appropriate abstraction levels. To address this, we propose augmenting vocabulary expansion with modules that better align abstraction levels across layers. Our models, extsc{SmolTolk}, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute. Representation analyses and improved multimodal performance suggest our method enhances cross-modal transfer.
Problem

Research questions and friction points this paper is trying to address.

Improve cross-modal transfer in Text-Speech Language Models.
Address limitations of unimodal speech language models.
Enhance feature compositionality and abstraction level alignment.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augments vocabulary expansion with abstraction alignment
Enhances cross-modal transfer via feature compositionality
Achieves state-of-the-art performance with less compute
🔎 Similar Papers
No similar papers found.