🤖 AI Summary
Large language models (LLMs) adapted to speech commonly suffer from a text-speech understanding gap—their linguistic comprehension lags significantly behind both pure-text LLMs and cascaded ASR-LLM systems.
Method: We propose SALAD, a framework for efficient modality alignment under extremely low-resource constraints, integrating active sample selection and cross-modal knowledge distillation. SALAD combines synthetic data augmentation with parameter-efficient fine-tuning, requiring only publicly available speech corpora—eliminating dependence on large-scale synthetic text or proprietary speech data.
Contribution/Results: Experiments show that SALAD matches state-of-the-art open-source LLMs in knowledge, language understanding, and reasoning tasks, while reducing speech training data requirements by over an order of magnitude. It effectively mitigates text capability forgetting and cross-modal misalignment. SALAD establishes a reproducible, cost-effective paradigm for low-resource speech-language joint modeling.
📝 Abstract
Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation--which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.