Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This study addresses the critical decoder selection problem in constructing multilingual speech large language models (SpeechLLMs). We propose a progressive three-stage collaborative training paradigm, integrating a fine-tuned Whisper-large-v3 encoder with either Gemma-3-12B or Qwen2.5-7B decoders, and introduce lightweight learnable linear/MLP projection modules to bridge the modality gap. To our knowledge, this is the first systematic empirical comparison of these two leading open-source LLMs in multilingual SpeechLLM settings. On a private test set, Gemma-3-12B achieves an average WER/CER of 16.63%, significantly outperforming Qwen2.5-7B’s 18.6%. Our approach ranks first in the MLC-SLM Challenge 2025, validating both architectural design and training strategy. The results establish an empirical benchmark and practical technical pathway for decoder selection and optimization in multilingual SpeechLLMs.

Technology Category

Application Category

📝 Abstract

This paper presents our system for the MLC-SLM Challenge 2025, focusing on multilingual speech recognition and language modeling with large language models (LLMs). Our approach combines a fine-tuned Whisper-large-v3 encoder with efficient projector architectures and various decoder configurations. We employ a three-stage training methodology that progressively optimizes the encoder, projector, and LLM components. Our system achieves competitive performance with a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6% using the Qwen2.5-7B as decoder-only language model.

Problem

Research questions and friction points this paper is trying to address.

Compares Qwen and Gemma integration with Whisper for multilingual speech recognition

Explores efficient projector architectures and decoder configurations in SpeechLLM systems

Evaluates performance using WER/CER metrics with different decoder-only language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned Whisper-large-v3 encoder

Efficient projector architectures

Three-stage training methodology

🔎 Similar Papers

No similar papers found.