Leveraging Language Information for Target Language Extraction

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional speech separation methods for multilingual mixed speech suffer from limited performance due to the absence of explicit language priors. Method: This paper proposes the first end-to-end language-aware speech separation framework, which integrates pretrained speech models to extract language representations and jointly optimizes them with a time-domain separation network to guide target-speech waveform reconstruction. Contribution/Results: We further introduce ML-TLE—the first publicly available multilingual target-language extraction dataset. Experiments on English–German mixed speech demonstrate that our method improves SI-SNR by 1.22 dB and 1.12 dB over baseline systems, respectively, significantly outperforming conventional language-agnostic separation approaches. These results empirically validate both the effectiveness and necessity of explicitly incorporating linguistic knowledge into speech separation.

Technology Category

Application Category

📝 Abstract
Target Language Extraction aims to extract speech in a specific language from a mixture waveform that contains multiple speakers speaking different languages. The human auditory system is adept at performing this task with the knowledge of the particular language. However, the performance of the conventional extraction systems is limited by the lack of this prior knowledge. Speech pre-trained models, which capture rich linguistic and phonetic representations from large-scale in-the-wild corpora, can provide this missing language knowledge to these systems. In this work, we propose a novel end-to-end framework to leverage language knowledge from speech pre-trained models. This knowledge is used to guide the extraction model to better capture the target language characteristics, thereby improving extraction quality. To demonstrate the effectiveness of our proposed approach, we construct the first publicly available multilingual dataset for Target Language Extraction. Experimental results show that our method achieves improvements of 1.22 dB and 1.12 dB in SI-SNR for English and German extraction, respectively, from mixtures containing both languages.
Problem

Research questions and friction points this paper is trying to address.

Extracting specific language speech from multilingual audio mixtures
Improving extraction quality using linguistic knowledge from pre-trained models
Addressing limited performance in conventional language extraction systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging pre-trained models for language knowledge
Using language knowledge to guide extraction model
Creating first multilingual dataset for language extraction
🔎 Similar Papers
No similar papers found.
M
Mehmet Sinan Yıldırım
Department of Electrical and Computer Engineering, National University of Singapore, Singapore
R
Ruijie Tao
Department of Electrical and Computer Engineering, National University of Singapore, Singapore
W
Wupeng Wang
Department of Electrical and Computer Engineering, National University of Singapore, Singapore
Junyi Ao
Junyi Ao
The Chinese University of Hong Kong, Shenzhen
Speech RecognitionSelf-Supervised Learning
Haizhou Li
Haizhou Li
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China; NUS, Singapore
Automatic Speech RecognitionSpeaker RecognitionLanguage RecognitionVoice ConversionMachine Translation