🤖 AI Summary
This work identifies and systematically evaluates two critical issues in speech large language models (Speech-LLMs): catastrophic forgetting—significant degradation of textual capabilities upon integrating speech modalities—and modality inequivalence—substantially lower performance on speech inputs compared to text inputs. To address these, we propose the first dual-channel cross-modal knowledge distillation framework, transferring knowledge from a pure-text teacher model to the Speech-LLM via two parallel pathways: text-to-text and speech-to-text. Our method integrates cross-modal alignment modeling with end-to-end joint training, supporting both dialogue and audio understanding tasks. Experiments demonstrate substantial improvements in speech-input performance across multiple benchmarks, while reducing textual capability degradation by up to 72%. This marks the first approach achieving synergistic optimization—enhancing speech understanding without compromising pre-existing text comprehension and reasoning abilities.
📝 Abstract
In this work, we present the first systematic evaluation of catastrophic forgetting and modality inequivalence in speech large language models, showing that introducing speech capabilities can degrade knowledge and reasoning even when inputs remain textual, and performance further decreases with spoken queries. To address these challenges, we propose a cross-modal knowledge distillation framework that leverages both text-to-text and speech-to-text channels to transfer knowledge from a text-based teacher model to a speech LLM. Extensive experiments on dialogue and audio understanding tasks validate the effectiveness of our approach in preserving textual knowledge, improving cross-modal alignment, and enhancing reasoning in speech-based interactions.