Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems

πŸ“… 2025-06-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the critical decoder selection problem in constructing multilingual speech large language models (SpeechLLMs). We propose a progressive three-stage collaborative training paradigm, integrating a fine-tuned Whisper-large-v3 encoder with either Gemma-3-12B or Qwen2.5-7B decoders, and introduce lightweight learnable linear/MLP projection modules to bridge the modality gap. To our knowledge, this is the first systematic empirical comparison of these two leading open-source LLMs in multilingual SpeechLLM settings. On a private test set, Gemma-3-12B achieves an average WER/CER of 16.63%, significantly outperforming Qwen2.5-7B’s 18.6%. Our approach ranks first in the MLC-SLM Challenge 2025, validating both architectural design and training strategy. The results establish an empirical benchmark and practical technical pathway for decoder selection and optimization in multilingual SpeechLLMs.

Technology Category

Application Category

πŸ“ Abstract
This paper presents our system for the MLC-SLM Challenge 2025, focusing on multilingual speech recognition and language modeling with large language models (LLMs). Our approach combines a fine-tuned Whisper-large-v3 encoder with efficient projector architectures and various decoder configurations. We employ a three-stage training methodology that progressively optimizes the encoder, projector, and LLM components. Our system achieves competitive performance with a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6% using the Qwen2.5-7B as decoder-only language model.
Problem

Research questions and friction points this paper is trying to address.

Compares Qwen and Gemma integration with Whisper for multilingual speech recognition
Explores efficient projector architectures and decoder configurations in SpeechLLM systems
Evaluates performance using WER/CER metrics with different decoder-only language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned Whisper-large-v3 encoder
Efficient projector architectures
Three-stage training methodology
πŸ”Ž Similar Papers
No similar papers found.
T
Tuan Nguyen
Institute for Infocomm Research (I2R), Aβˆ—STAR, Singapore
Long-Vu Hoang
Long-Vu Hoang
Hanoi University of Science and Technology
Speaker RecognitionSpeech RecognitionLarge Language Models
H
Huy-Dat Tran
Institute for Infocomm Research (I2R), Aβˆ—STAR, Singapore