The TEA-ASLP System for Multilingual Conversational Speech Recognition and Speech Diarization in MLC-SLM 2025 Challenge

πŸ“… 2025-07-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses two challenges in multilingual conversational speech recognition (Task I) and speaker-attributed speech recognition (Task II) by proposing an enhanced Ideal-LLM model. Methodologically: (1) it integrates a language identification module with a multilingual Mixture-of-Experts (MoE) architecture, employing MoE-specific Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning; (2) it introduces a CTC-guided autoregressive generation mechanism to strengthen cross-lingual modeling; and (3) it leverages a monolingual English speaker diarization model to improve segmentation robustness. The model is jointly optimized on 180,000 hours of multilingual ASR data using a unified CTC-autoregressive framework augmented with speaker disambiguation techniques. Experiments show a 9.60% word error rate (WER) on Task Iβ€”reducing the baseline by 30.8%β€”and a minimum-permutation WER of 17.49% under time-constrained evaluation for Task II. These results rank first and second, respectively, in the corresponding challenge tasks.

Technology Category

Application Category

πŸ“ Abstract
This paper presents the TEA-ASLP's system submitted to the MLC-SLM 2025 Challenge, addressing multilingual conversational automatic speech recognition (ASR) in Task I and speech diarization ASR in Task II. For Task I, we enhance Ideal-LLM model by integrating known language identification and a multilingual MOE LoRA structure, along with using CTC-predicted tokens as prompts to improve autoregressive generation. The model is trained on approximately 180k hours of multilingual ASR data. In Task II, we replace the baseline English-Chinese speaker diarization model with a more suitable English-only version. Our approach achieves a 30.8% reduction in word error rate (WER) compared to the baseline speech language model, resulting in a final WER of 9.60% in Task I and a time-constrained minimum-permutation WER of 17.49% in Task II, earning first and second place in the respective challenge tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multilingual conversational speech recognition accuracy
Improving speech diarization for English-only speakers
Reducing word error rate in ASR tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates language ID and multilingual MOE LoRA
Uses CTC-predicted tokens for autoregressive prompts
Replaces diarization model with English-only version
πŸ”Ž Similar Papers
No similar papers found.
H
Hongfei Xue
Tencent Ethereal Audio Lab, Tencent Corporation, Beijing, China
Kaixun Huang
Kaixun Huang
Northwestern Polytechnical University
Zhikai Zhou
Zhikai Zhou
Tencent Ethereal Audio Lab, Tencent Corporation, Beijing, China
Shen Huang
Shen Huang
Director of Search, Yihaodian.com
Machine learningdata miningsearchrecommendationpersonalization
S
Shidong Shang
Tencent Ethereal Audio Lab, Tencent Corporation, Beijing, China