Adapting Text LLMs to Speech via Multimodal Depth Up-Scaling

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the significant degradation of textual capabilities often observed when adapting pretrained large language models (LLMs) to speech-language modeling via continued pretraining. To mitigate this trade-off, the authors propose a multimodal deep extension strategy that freezes the original LLM backbone and inserts a small number of additional Transformer layers, trained with an E-Branchformer architecture specifically designed for automatic speech recognition (ASR). Applied to SmolLM2-1.7B, this approach trains only approximately 40% of the model’s parameters yet achieves ASR performance on par with or superior to full fine-tuning, while reducing textual capability degradation by over 75%. The method thus effectively balances strong speech understanding with robust retention of original text-based competencies.

Technology Category

Application Category

📝 Abstract

Adapting pre-trained text Large Language Models (LLMs) into Speech Language Models (Speech LMs) via continual pretraining on speech data is promising, but often degrades the original text capabilities. We propose Multimodal Depth Upscaling, an extension of an emerging strategy in continual LLM pre-training, where new transformer layers are inserted into a frozen text LLM and only the added layers are trained on speech data. Experiments with SmolLM2-360M and SmolLM2-1.7B on 48k hours of English Automatic Speech Recognition (ASR) data show that depth up-scaling achieves ASR comparable to full fine-tuning while causing far less text degradation than both full fine-tuning and Low-Rank Adaptation (LoRA). We further show that incorporating E-Branchformer, an architecture designed for speech recognition, as the inserted layers achieves ASR that matches or surpasses full fine-tuning on the larger model while reducing text degradation by over 75% with 60% fewer trainable parameters.

Problem

Research questions and friction points this paper is trying to address.

Speech Language Models

Continual Pretraining

Text Degradation

Multimodal Adaptation

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Depth Upscaling

Speech Language Models

E-Branchformer

Continual Pretraining

Parameter-Efficient Adaptation

🔎 Similar Papers

SSR: Alignment-Aware Modality Connector for Speech Language Models

2024-09-30arXiv.orgCitations: 3

Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

2024-09-30arXiv.orgCitations: 1