The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address data scarcity and inequitable benefit distribution in automatic speech recognition (ASR) for low-resource languages—particularly in Africa—this work proposes a community-centered, sustainable data governance framework. Methodologically, it introduces the novel Esethu Community License, ensuring linguistic data sovereignty and equitable benefit-sharing for native speakers; constructs ViXSD—the first open-source, native-speaker-led isiXhosa speech corpus annotated with fine-grained sociolinguistic metadata; and integrates participatory data collection, structured metadata modeling, and ASR validation using Kaldi and Whisper. Results demonstrate that ViXSD substantially improves isiXhosa ASR performance and establishes a reproducible, ethics-technical co-design paradigm for low-resource language AI. This framework advances both methodological innovation—through community-embedded data governance—and practical scalability, offering a transferable model for ethical, inclusive, and sustainable AI development in under-resourced linguistic contexts.

Technology Category

Application Category

📝 Abstract
This paper presents the Esethu Framework, a sustainable data curation framework specifically designed to empower local communities and ensure equitable benefit-sharing from their linguistic resources. This framework is supported by the Esethu license, a novel community-centric data license. As a proof of concept, we introduce the Vuk'uzenzele isiXhosa Speech Dataset (ViXSD), an open-source corpus developed under the Esethu Framework and License. The dataset, containing read speech from native isiXhosa speakers enriched with demographic and linguistic metadata, demonstrates how community-driven licensing and curation principles can bridge resource gaps in automatic speech recognition (ASR) for African languages while safeguarding the interests of data creators. We describe the framework guiding dataset development, outline the Esethu license provisions, present the methodology for ViXSD, and present ASR experiments validating ViXSD's usability in building and refining voice-driven applications for isiXhosa.
Problem

Research questions and friction points this paper is trying to address.

Sustainable dataset governance
Low-resource language empowerment
Community-centric data licensing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Community-centric data license
Linguistic metadata enrichment
Automatic speech recognition validation
🔎 Similar Papers
No similar papers found.
J
Jenalea Rajab
RAIL Lab – University of the Witwatersrand, Lelapa AI
A
Anuoluwapo Aremu
Lelapa AI
E
E. Chimoto
Lelapa AI
D
Dale Dunbar
Way With Words
G
Graham Morrissey
Way With Words
F
Fadel Thior
Lelapa AI
L
Luandrie Potgieter
Lelapa AI
J
Jessico Ojo
Lelapa AI
A
A. Tonja
Lelapa AI, MBZUAI
M
Maushami Chetty
Aarya Legal
O
Onyothi Nekoto
Masakhane
P
Pelonomi Moiloa
Lelapa AI
Jade Abbott
Jade Abbott
Masakhane, Lelapa AI
Natural Language ProcessingArtificial IntelligenceComputational Intelligence
V
V. Marivate
DSFSI – University of Pretoria, Lelapa AI
Benjamin Rosman
Benjamin Rosman
Professor at the University of the Witwatersrand, South Africa
RoboticsArtificial IntelligenceMachine LearningDecision MakingReinforcement Learning