Cross-Modal Knowledge Distillation for Speech Large Language Models

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work identifies and systematically evaluates two critical issues in speech large language models (Speech-LLMs): catastrophic forgetting—significant degradation of textual capabilities upon integrating speech modalities—and modality inequivalence—substantially lower performance on speech inputs compared to text inputs. To address these, we propose the first dual-channel cross-modal knowledge distillation framework, transferring knowledge from a pure-text teacher model to the Speech-LLM via two parallel pathways: text-to-text and speech-to-text. Our method integrates cross-modal alignment modeling with end-to-end joint training, supporting both dialogue and audio understanding tasks. Experiments demonstrate substantial improvements in speech-input performance across multiple benchmarks, while reducing textual capability degradation by up to 72%. This marks the first approach achieving synergistic optimization—enhancing speech understanding without compromising pre-existing text comprehension and reasoning abilities.

Technology Category

Application Category

📝 Abstract

In this work, we present the first systematic evaluation of catastrophic forgetting and modality inequivalence in speech large language models, showing that introducing speech capabilities can degrade knowledge and reasoning even when inputs remain textual, and performance further decreases with spoken queries. To address these challenges, we propose a cross-modal knowledge distillation framework that leverages both text-to-text and speech-to-text channels to transfer knowledge from a text-based teacher model to a speech LLM. Extensive experiments on dialogue and audio understanding tasks validate the effectiveness of our approach in preserving textual knowledge, improving cross-modal alignment, and enhancing reasoning in speech-based interactions.

Problem

Research questions and friction points this paper is trying to address.

Addressing catastrophic forgetting in speech LLMs

Mitigating modality inequivalence in speech models

Enhancing cross-modal knowledge and reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal knowledge distillation framework

Text-to-text and speech-to-text channels

Transfer knowledge from text teacher model

🔎 Similar Papers

SSR: Alignment-Aware Modality Connector for Speech Language Models