Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address low automatic speech recognition (ASR) accuracy in multilingual dialogue scenarios, this paper proposes an Encoder-Adapter-LLM architecture. It introduces a lightweight adapter module to enable efficient alignment between a speech encoder and a multilingual large language model (LLM), facilitating cross-lingual knowledge transfer. Integrated with domain adaptation and multi-stage training strategies, the framework jointly optimizes speech representation learning and linguistic understanding. Trained on large-scale multilingual audio data, the model achieves state-of-the-art word error rates (WER) on both development and test sets in Task 1 of the INTERSPEECH 2025 MLC-SLM Challenge, securing second place. Empirical results demonstrate its effectiveness and strong generalization capability for complex, code-switched, multilingual conversational ASR.

Technology Category

Application Category

📝 Abstract

This paper describes our Triple X speech recognition system submitted to Task 1 of the Multi-Lingual Conversational Speech Language Modeling (MLC-SLM) Challenge. Our work focuses on optimizing speech recognition accuracy in multilingual conversational scenarios through an innovative encoder-adapter-LLM architecture. This framework harnesses the powerful reasoning capabilities of text-based large language models while incorporating domain-specific adaptations. To further enhance multilingual recognition performance, we adopted a meticulously designed multi-stage training strategy leveraging extensive multilingual audio datasets. Experimental results demonstrate that our approach achieves competitive Word Error Rate (WER) performance on both dev and test sets, obtaining second place in the challenge ranking.

Problem

Research questions and friction points this paper is trying to address.

Optimizing multilingual speech recognition accuracy in conversations

Leveraging LLM with domain adaptations for speech recognition

Enhancing performance via multi-stage multilingual audio training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Encoder-adapter-LLM architecture for multilingual speech recognition

Multi-stage training strategy with multilingual datasets

Leveraging LLM reasoning with domain-specific adaptations

🔎 Similar Papers

No similar papers found.

Authors to Follow