mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM evaluations exhibit strong bias toward English and high-resource languages, with a critical lack of multilingual, multimodal benchmarks—especially for low-resource languages. To address this gap, we introduce mSTEB, the first large-scale multilingual speech-text evaluation benchmark, covering 100+ languages—including numerous low-resource African, American, and Oceanic languages—and supporting cross-modal tasks such as language identification, text classification, question answering, and translation. Our contributions are threefold: (1) the first systematic, unified evaluation of cross-lingual capabilities in both speech- and text-based LLMs; (2) construction of a speech-text aligned dataset, along with standardized cross-modal tasks and a unified zero-shot/few-shot evaluation protocol; and (3) an open-source, reproducible evaluation framework. Experiments on Gemini 2.0 Flash, GPT-4o (Audio), Qwen2-Audio, and Gemma3-27B reveal a 42.7% average performance drop on low-resource languages versus English, underscoring data-induced biases and advancing fairness-aware evaluation paradigms.

Technology Category

Application Category

📝 Abstract
Large Language models (LLMs) have demonstrated impressive performance on a wide range of tasks, including in multimodal settings such as speech. However, their evaluation is often limited to English and a few high-resource languages. For low-resource languages, there is no standardized evaluation benchmark. In this paper, we address this gap by introducing mSTEB, a new benchmark to evaluate the performance of LLMs on a wide range of tasks covering language identification, text classification, question answering, and translation tasks on both speech and text modalities. We evaluated the performance of leading LLMs such as Gemini 2.0 Flash and GPT-4o (Audio) and state-of-the-art open models such as Qwen 2 Audio and Gemma 3 27B. Our evaluation shows a wide gap in performance between high-resource and low-resource languages, especially for languages spoken in Africa and Americas/Oceania. Our findings show that more investment is needed to address their under-representation in LLMs coverage.
Problem

Research questions and friction points this paper is trying to address.

Lack of standardized LLM evaluation for low-resource languages
Need multilingual benchmark for speech/text tasks across diverse languages
Performance gap between high-resource and underrepresented languages persists
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces mSTEB for multilingual LLM evaluation
Evaluates LLMs on diverse speech and text tasks
Highlights performance gap in low-resource languages
🔎 Similar Papers
No similar papers found.