🤖 AI Summary
This work addresses the challenge of speech-to-speech translation for low-resource Nigerian languages—specifically Igbo, Hausa, Yoruba, and Nigerian Pidgin—where progress has been hindered by the scarcity of high-quality, multi-accent parallel speech data. The authors present NaijaS2ST, the first large-scale, real-world speech-to-speech translation dataset for these languages paired with English, and conduct a systematic evaluation of cascaded, end-to-end, and AudioLLM-based approaches on bidirectional translation tasks. Experimental results show that AudioLLMs outperform fine-tuned models in few-shot speech-to-text translation, yet in speech-to-speech translation, cascaded systems achieve performance on par with AudioLLMs, highlighting the need for dedicated architectural innovations to advance this challenging task.
📝 Abstract
Speech translation for low-resource languages remains fundamentally limited by the scarcity of high-quality, diverse parallel speech data, a challenge that is especially pronounced in African linguistic contexts. To address this, we introduce NaijaS2ST, a parallel speech translation dataset spanning Igbo, Hausa, Yorùbá, and Nigerian Pidgin paired with English. The dataset comprises approximately 50 hours of speech per language and captures substantial variation in speakers and accents, reflecting realistic multilingual and multi-accent conditions. With NaijaS2ST, we conduct a comprehensive benchmark of cascaded, end-to-end (E2E), and AudioLLM-based approaches across bidirectional translation settings. Our results show that audio LLMs with few-shot examples are more effective for speech-to-text translation than cascaded and end-to-end methods trained on fine-tuned data. However, for speech-to-speech translation, the cascaded and audio LLM paradigms yield comparable performance, indicating that there is still considerable room for improvement in developing targeted, task-specific models for this setting. By providing both a high-quality dataset and a systematic benchmark, we hope that NaijaS2ST will serve as a strong foundation for advancing research in low-resource, multilingual speech translation.