Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low automatic speech recognition (ASR) accuracy in multilingual dialogue scenarios, this paper proposes an Encoder-Adapter-LLM architecture. It introduces a lightweight adapter module to enable efficient alignment between a speech encoder and a multilingual large language model (LLM), facilitating cross-lingual knowledge transfer. Integrated with domain adaptation and multi-stage training strategies, the framework jointly optimizes speech representation learning and linguistic understanding. Trained on large-scale multilingual audio data, the model achieves state-of-the-art word error rates (WER) on both development and test sets in Task 1 of the INTERSPEECH 2025 MLC-SLM Challenge, securing second place. Empirical results demonstrate its effectiveness and strong generalization capability for complex, code-switched, multilingual conversational ASR.

Technology Category

Application Category

📝 Abstract
This paper describes our Triple X speech recognition system submitted to Task 1 of the Multi-Lingual Conversational Speech Language Modeling (MLC-SLM) Challenge. Our work focuses on optimizing speech recognition accuracy in multilingual conversational scenarios through an innovative encoder-adapter-LLM architecture. This framework harnesses the powerful reasoning capabilities of text-based large language models while incorporating domain-specific adaptations. To further enhance multilingual recognition performance, we adopted a meticulously designed multi-stage training strategy leveraging extensive multilingual audio datasets. Experimental results demonstrate that our approach achieves competitive Word Error Rate (WER) performance on both dev and test sets, obtaining second place in the challenge ranking.
Problem

Research questions and friction points this paper is trying to address.

Optimizing multilingual speech recognition accuracy in conversations
Leveraging LLM with domain adaptations for speech recognition
Enhancing performance via multi-stage multilingual audio training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Encoder-adapter-LLM architecture for multilingual speech recognition
Multi-stage training strategy with multilingual datasets
Leveraging LLM reasoning with domain-specific adaptations
🔎 Similar Papers
No similar papers found.
M
Miaomiao Gao
Aerospace Information Research Institute, Chinese Academy of Sciences
X
Xiaoxiao Xiang
LIGHTSPEED
Yiwen Guo
Yiwen Guo
Research Scientist
Machine LearningDeep LearningImage Processing