The TEA-ASLP System for Multilingual Conversational Speech Recognition and Speech Diarization in MLC-SLM 2025 Challenge

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses two challenges in multilingual conversational speech recognition (Task I) and speaker-attributed speech recognition (Task II) by proposing an enhanced Ideal-LLM model. Methodologically: (1) it integrates a language identification module with a multilingual Mixture-of-Experts (MoE) architecture, employing MoE-specific Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning; (2) it introduces a CTC-guided autoregressive generation mechanism to strengthen cross-lingual modeling; and (3) it leverages a monolingual English speaker diarization model to improve segmentation robustness. The model is jointly optimized on 180,000 hours of multilingual ASR data using a unified CTC-autoregressive framework augmented with speaker disambiguation techniques. Experiments show a 9.60% word error rate (WER) on Task I—reducing the baseline by 30.8%—and a minimum-permutation WER of 17.49% under time-constrained evaluation for Task II. These results rank first and second, respectively, in the corresponding challenge tasks.

Technology Category

Application Category

📝 Abstract

This paper presents the TEA-ASLP's system submitted to the MLC-SLM 2025 Challenge, addressing multilingual conversational automatic speech recognition (ASR) in Task I and speech diarization ASR in Task II. For Task I, we enhance Ideal-LLM model by integrating known language identification and a multilingual MOE LoRA structure, along with using CTC-predicted tokens as prompts to improve autoregressive generation. The model is trained on approximately 180k hours of multilingual ASR data. In Task II, we replace the baseline English-Chinese speaker diarization model with a more suitable English-only version. Our approach achieves a 30.8% reduction in word error rate (WER) compared to the baseline speech language model, resulting in a final WER of 9.60% in Task I and a time-constrained minimum-permutation WER of 17.49% in Task II, earning first and second place in the respective challenge tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multilingual conversational speech recognition accuracy

Improving speech diarization for English-only speakers

Reducing word error rate in ASR tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates language ID and multilingual MOE LoRA

Uses CTC-predicted tokens for autoregressive prompts

Replaces diarization model with English-only version

🔎 Similar Papers

No similar papers found.