Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision

πŸ“… 2026-03-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of language interference in multilingual speech large language models when trained solely on ASR-labeled data via distillation, which hinders effective instruction following due to shared projection layers. To mitigate this, the authors propose a language-aware distillation framework that introduces a learnable query bank and a gating network to dynamically select or blend language-specific query tokens. These tokens are then processed through a Q-Former projector, enabling efficient instruction-following training under pure ASR supervision. The proposed method achieves a 14% improvement over the baseline on instruction-following tasks and outperforms existing speech large models by 32% on Audio-MLQA, a newly constructed multilingual speech question-answering benchmark.

Technology Category

Application Category

πŸ“ Abstract
Speech Large Language Models (LLMs) that understand and follow instructions in many languages are useful for real-world interaction, but are difficult to train with supervised fine-tuning, requiring large, task-specific speech corpora. While recent distillation-based approaches train performant English-only Speech LLMs using only annotated ASR data by aligning text and speech using only a lightweight projector, these models under-perform when scaled to multilingual settings due to language interference in the shared projector. We address this by introducing language-aware distillation using a query bank and a gating network that selects or mixes query tokens using a Q-Former projector. Our approach shows gains of 14% over matched multilingual distillation baselines on instruction following. We further synthesize Audio-MLQA, a multilingual spoken QA benchmark built on MLQA with high-quality TTS questions. Our best model improves over existing Speech LLM baselines by 32% on Audio-MLQA.
Problem

Research questions and friction points this paper is trying to address.

multilingual
speech LLMs
language interference
instruction-following
ASR-only supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

language-aware distillation
multilingual speech LLM
Q-Former projector
query bank
Audio-MLQA
πŸ”Ž Similar Papers
2023-11-02IEEE International Conference on Acoustics, Speech, and Signal ProcessingCitations: 20
S
Shreyas Gopal
College of Computing and Data Science, Nanyang Technological University, Singapore
D
Donghang Wu
College of Computing and Data Science, Nanyang Technological University, Singapore; AI Singapore, National University of Singapore
A
Ashutosh Anshul
College of Computing and Data Science, Nanyang Technological University, Singapore
Y
Yeo Yue Heng
College of Computing and Data Science, Nanyang Technological University, Singapore; Institute for Infocomm Research (I2R), Aβˆ—STAR, Singapore
Y
Yizhou Peng
College of Computing and Data Science, Nanyang Technological University, Singapore
Haoyang Li
Haoyang Li
PhD student, Nanyang Technological University
Speech SynthesisSpeech EnhancementAutomatic Speech Recognition
Hexin Liu
Hexin Liu
Nanyang Technological University
Speech recognitionlanguage identification
E
Eng Siong Chng
College of Computing and Data Science, Nanyang Technological University, Singapore