🤖 AI Summary
To address the high computational overhead, cross-lingual interference, suboptimal configuration, and poor scalability inherent in joint training for multilingual speech recognition and translation, this paper proposes LoRS-Merging—a low-rank and sparse collaborative model merging paradigm. It is the first framework to unify low-rank approximation, structured sparsity pruning, and parameter-space alignment within a single-task model merging setting. By preserving language-specific structural essentials while suppressing cross-lingual interference, LoRS-Merging enables efficient, lossless fusion of monolingual models. Evaluated on multilingual speech-to-text (S2T) benchmarks, it consistently outperforms joint-training baselines—including Whisper—achieving a 32% inference speedup and 41% memory reduction. Moreover, it supports plug-and-play language expansion without retraining.
📝 Abstract
Language diversity presents a significant challenge in speech-to-text (S2T) tasks, such as automatic speech recognition and translation. Traditional multi-task training approaches aim to address this by jointly optimizing multiple speech recognition and translation tasks across various languages. While models like Whisper, built on these strategies, demonstrate strong performance, they still face issues of high computational cost, language interference, suboptimal training configurations, and limited extensibility. To overcome these challenges, we introduce LoRS-Merging (low-rank and sparse model merging), a novel technique designed to efficiently integrate models trained on different languages or tasks while preserving performance and reducing computational overhead. LoRS-Merging combines low-rank and sparse pruning to retain essential structures while eliminating redundant parameters, mitigating language and task interference, and enhancing extensibility. Experimental results across a range of languages demonstrate that LoRS-Merging significantly outperforms conventional multi-lingual multi-task training baselines. Our findings suggest that model merging, particularly LoRS-Merging, is a scalable and effective complement to traditional multi-lingual training strategies for S2T applications.