🤖 AI Summary
This work addresses the inefficiency of current large language model (LLM)-based automatic speech recognition (ASR) systems, which train separate connectors for each language while disregarding linguistic phylogenetic relationships, leading to parameter redundancy and limited generalization. To overcome this, the study introduces a novel approach that incorporates language family information into the design of LLM-ASR connectors for the first time. It proposes a lightweight, language-family-shared connector that enables knowledge transfer across multiple languages within the same family, situated between a frozen speech encoder and a pretrained LLM. The method substantially reduces model parameters while demonstrating improved cross-lingual recognition performance on two multilingual LLMs and real-world speech corpora, achieving both deployment efficiency and enhanced generalization.
📝 Abstract
Large Language Model (LLM)-powered Automatic Speech Recognition (ASR) systems achieve strong performance with limited resources by linking a frozen speech encoder to a pretrained LLM via a lightweight connector. Prior work trains a separate connector per language, overlooking linguistic relatedness. We propose an efficient and novel connector-sharing strategy based on linguistic family membership, enabling one connector per family, and empirically validate its effectiveness across two multilingual LLMs and two real-world corpora spanning curated and crowd-sourced speech. Our results show that family-based connectors reduce parameter count while improving generalization across domains, offering a practical and scalable strategy for multilingual ASR deployment.