Speech Language Models for Under-Represented Languages: Insights from Wolof

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This study addresses the challenge of speech modeling for low-resource languages such as Wolof. We introduce the first end-to-end Wolof Speech-Language Model (SLM). Methodologically, we perform continual pretraining using HuBERT on a high-quality spontaneous speech dataset and innovatively inject knowledge from a Wolof large language model (LLM) into the speech encoder, jointly modeling automatic speech recognition (ASR), speech translation (ST), and multi-step chain-of-thought reasoning. Our contributions are threefold: (1) the first speech–language joint modeling framework for a low-resource African language; (2) empirical validation that high-quality data and continual pretraining are critical for performance gains; and (3) state-of-the-art results on ASR and ST benchmarks—significantly outperforming baseline models and existing Africa-centric systems—thereby establishing a reproducible technical pathway for low-resource speech foundation models.

Technology Category

Application Category

📝 Abstract

We present our journey in training a speech language model for Wolof, an underrepresented language spoken in West Africa, and share key insights. We first emphasize the importance of collecting large-scale, spontaneous, high-quality speech data, and show that continued pretraining HuBERT on this dataset outperforms both the base model and African-centric models on ASR. We then integrate this speech encoder into a Wolof LLM to train the first Speech LLM for this language, extending its capabilities to tasks such as speech translation. Furthermore, we explore training the Speech LLM to perform multi-step Chain-of-Thought before transcribing or translating. Our results show that the Speech LLM not only improves speech recognition but also performs well in speech translation. The models and the code will be openly shared.

Problem

Research questions and friction points this paper is trying to address.

Training speech language model for underrepresented Wolof language

Improving automatic speech recognition with large-scale Wolof data

Developing speech translation capabilities through integrated Speech LLM

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continued pretraining HuBERT on Wolof speech data

Integrated speech encoder into Wolof LLM

Trained Speech LLM with Chain-of-Thought reasoning

🔎 Similar Papers

Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings