🤖 AI Summary
This work addresses the limitation in text-to-speech language models (TSLMs) where cross-modal knowledge transfer is hindered by abstraction-level mismatch between text and speech representations. We propose an inter-layer abstraction alignment mechanism: building upon a pretrained text language model, we introduce learnable cross-modal projection layers and hierarchical adapters to align the abstraction levels of text and speech representations depth-wise—marking the first such fine-grained alignment, surpassing coarse-grained vocabulary-extension paradigms. This enables effective cross-modal function reuse and joint modeling of text and speech representations. The resulting SmolTolk model achieves state-of-the-art or competitive performance on ASR, TTS, and cross-modal retrieval tasks, despite incurring only ~10% of the computational cost of existing SOTA TSLMs.
📝 Abstract
Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- aim to enable cross-modal knowledge transfer to overcome the scaling limitations of unimodal speech LMs. The predominant approach to TSLM training expands the vocabulary of a pre-trained text LM by appending new embeddings and linear projections for speech, followed by fine-tuning on speech data. We hypothesize that this method limits cross-modal transfer by neglecting feature compositionality, preventing text-learned functions from being fully leveraged at appropriate abstraction levels. To address this, we propose augmenting vocabulary expansion with modules that better align abstraction levels across layers. Our models, extsc{SmolTolk}, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute. Representation analyses and improved multimodal performance suggest our method enhances cross-modal transfer.