Closing the Gap Between Text and Speech Understanding in LLMs

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Large language models (LLMs) adapted to speech commonly suffer from a text-speech understanding gap—their linguistic comprehension lags significantly behind both pure-text LLMs and cascaded ASR-LLM systems. Method: We propose SALAD, a framework for efficient modality alignment under extremely low-resource constraints, integrating active sample selection and cross-modal knowledge distillation. SALAD combines synthetic data augmentation with parameter-efficient fine-tuning, requiring only publicly available speech corpora—eliminating dependence on large-scale synthetic text or proprietary speech data. Contribution/Results: Experiments show that SALAD matches state-of-the-art open-source LLMs in knowledge, language understanding, and reasoning tasks, while reducing speech training data requirements by over an order of magnitude. It effectively mitigates text capability forgetting and cross-modal misalignment. SALAD establishes a reproducible, cost-effective paradigm for low-resource speech-language joint modeling.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation--which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.

Problem

Research questions and friction points this paper is trying to address.

Addressing performance gap between speech and text understanding in LLMs

Mitigating forgetting of text capabilities during speech adaptation

Improving cross-modal alignment with data-efficient distillation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines cross-modal distillation with targeted synthetic data

Uses active selection to improve alignment and mitigate forgetting

Trains on significantly less speech data from public corpora

🔎 Similar Papers

SSR: Alignment-Aware Modality Connector for Speech Language Models