🤖 AI Summary
This study addresses the lack of effective evaluation benchmarks for automatic speech recognition (ASR) systems in scenarios involving multilingual code-switching, dense scientific terminology, and conversational complexity. To bridge this gap, the authors introduce the first ASR evaluation dataset derived from authentic multilingual scientific discussions, comprising recordings of multiple speakers conversing bilingually about research papers, accompanied by audio segmentation, speaker diarization, and multilingual transcripts. The work proposes a comprehensive evaluation framework that extends beyond conventional word error rate metrics to enable consistent cross-lingual performance comparison. Experimental results demonstrate a significant performance drop among state-of-the-art ASR systems on this benchmark, underscoring both its challenge and practical relevance for advancing multilingual ASR in specialized domains.
📝 Abstract
The goal of multilingual speech technology is to facilitate seamless communication between individuals speaking different languages, creating the experience as though everyone were a multilingual speaker. To create this experience, speech technology needs to address several challenges: Handling mixed multilingual input, specific vocabulary, and code-switching. However, there is currently no dataset benchmarking this situation. We propose a new benchmark to evaluate current Automatic Speech Recognition (ASR) systems, whether they are able to handle these challenges. The benchmark consists of bilingual discussions on scientific papers between multiple speakers, each conversing in a different language. We provide a standard evaluation framework, beyond Word Error Rate (WER) enabling consistent comparison of ASR performance across languages. Experimental results demonstrate that the proposed dataset is still an open challenge for state-of-the-art ASR systems. The dataset is available in https://huggingface.co/datasets/goodpiku/muscat-eval
\\ \newline \Keywords{multilingual, speech recognition, audio segmentation, speaker diarization}