π€ AI Summary
This study addresses the critical decoder selection problem in constructing multilingual speech large language models (SpeechLLMs). We propose a progressive three-stage collaborative training paradigm, integrating a fine-tuned Whisper-large-v3 encoder with either Gemma-3-12B or Qwen2.5-7B decoders, and introduce lightweight learnable linear/MLP projection modules to bridge the modality gap. To our knowledge, this is the first systematic empirical comparison of these two leading open-source LLMs in multilingual SpeechLLM settings. On a private test set, Gemma-3-12B achieves an average WER/CER of 16.63%, significantly outperforming Qwen2.5-7Bβs 18.6%. Our approach ranks first in the MLC-SLM Challenge 2025, validating both architectural design and training strategy. The results establish an empirical benchmark and practical technical pathway for decoder selection and optimization in multilingual SpeechLLMs.
π Abstract
This paper presents our system for the MLC-SLM Challenge 2025, focusing on multilingual speech recognition and language modeling with large language models (LLMs). Our approach combines a fine-tuned Whisper-large-v3 encoder with efficient projector architectures and various decoder configurations. We employ a three-stage training methodology that progressively optimizes the encoder, projector, and LLM components. Our system achieves competitive performance with a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6% using the Qwen2.5-7B as decoder-only language model.