🤖 AI Summary
Current speech recognition and spoken language understanding (SLU) models exhibit limited performance on low-resource and unwritten languages, primarily due to the scarcity of high-quality multilingual speech–semantic annotations and the inadequacy of existing evaluation tasks (e.g., language identification, intent classification) in comprehensively assessing SLU capabilities. To address this, we introduce Fleurs-SLU—the first large-scale, multilingual SLU benchmark—covering topic classification across 102 languages and listening comprehension multiple-choice QA across 92 languages, enabling unified end-to-end and cascade SLU evaluation at the hundred-language scale. We propose cross-lingual semantic alignment training and low-resource speech–text joint augmentation to validate synergistic gains between speech and semantic representations. Our cascade system demonstrates superior robustness over end-to-end approaches; fine-tuning only the speech encoder achieves topic classification performance on par with the full cascade system. Furthermore, we empirically confirm strong correlations among ASR, speech translation, and SLU performance.
📝 Abstract
While recent multilingual automatic speech recognition models claim to support thousands of languages, ASR for low-resource languages remains highly unreliable due to limited bimodal speech and text training data. Better multilingual spoken language understanding (SLU) can strengthen massively the robustness of multilingual ASR by levering language semantics to compensate for scarce training data, such as disambiguating utterances via context or exploiting semantic similarities across languages. Even more so, SLU is indispensable for inclusive speech technology in roughly half of all living languages that lack a formal writing system. However, the evaluation of multilingual SLU remains limited to shallower tasks such as intent classification or language identification. To address this, we present Fleurs-SLU, a multilingual SLU benchmark that encompasses topical speech classification in 102 languages and multiple-choice question answering through listening comprehension in 92 languages. We extensively evaluate both end-to-end speech classification models and cascaded systems that combine speech-to-text transcription with subsequent classification by large language models on Fleurs-SLU. Our results show that cascaded systems exhibit greater robustness in multilingual SLU tasks, though speech encoders can achieve competitive performance in topical speech classification when appropriately pre-trained. We further find a strong correlation between robust multilingual ASR, effective speech-to-text translation, and strong multilingual SLU, highlighting the mutual benefits between acoustic and semantic speech representations.