🤖 AI Summary
To address low automatic speech recognition (ASR) accuracy in multilingual dialogue scenarios, this paper proposes an Encoder-Adapter-LLM architecture. It introduces a lightweight adapter module to enable efficient alignment between a speech encoder and a multilingual large language model (LLM), facilitating cross-lingual knowledge transfer. Integrated with domain adaptation and multi-stage training strategies, the framework jointly optimizes speech representation learning and linguistic understanding. Trained on large-scale multilingual audio data, the model achieves state-of-the-art word error rates (WER) on both development and test sets in Task 1 of the INTERSPEECH 2025 MLC-SLM Challenge, securing second place. Empirical results demonstrate its effectiveness and strong generalization capability for complex, code-switched, multilingual conversational ASR.
📝 Abstract
This paper describes our Triple X speech recognition system submitted to Task 1 of the Multi-Lingual Conversational Speech Language Modeling (MLC-SLM) Challenge. Our work focuses on optimizing speech recognition accuracy in multilingual conversational scenarios through an innovative encoder-adapter-LLM architecture. This framework harnesses the powerful reasoning capabilities of text-based large language models while incorporating domain-specific adaptations. To further enhance multilingual recognition performance, we adopted a meticulously designed multi-stage training strategy leveraging extensive multilingual audio datasets. Experimental results demonstrate that our approach achieves competitive Word Error Rate (WER) performance on both dev and test sets, obtaining second place in the challenge ranking.