Marito: Structuring and Building Open Multilingual Terminologies for South African NLP

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
South Africa’s official languages have long lacked structured, machine-readable terminology resources, severely hindering multilingual natural language processing (NLP) development. Method: This paper introduces the first African-centered, open-source multilingual terminology framework—built on NOODL—that systematically integrates, cleans, and standardizes fragmented, unstructured term lists from government and academic institutions into an open, interoperable multilingual terminology repository. It innovatively incorporates retrieval-augmented generation (RAG) to dynamically inject structured terminology into large language model (LLM)-based translation pipelines. Contribution/Results: Experiments demonstrate significant improvements in accuracy and domain consistency for English-to-Tsonga translation, validating the terminology repository’s critical role in advancing equitable, locally grounded NLP technologies. The framework bridges infrastructural gaps in African language NLP and supports scalable, context-aware, and terminologically precise machine translation.

Technology Category

Application Category

📝 Abstract
The critical lack of structured terminological data for South Africa's official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. emph{Marito} addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational emph{Marito} dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models. emph{Marito} provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa's rich linguistic diversity is represented in the digital age.
Problem

Research questions and friction points this paper is trying to address.

Lack of structured multilingual terminologies for South African NLP
Fragmented terminological data in non-machine-readable formats
Need for scalable foundation for equitable NLP technologies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aggregating and standardizing multilingual terminological data
Releasing datasets under Africa-centered NOODL framework
Integrating terminology into Retrieval-Augmented Generation pipeline
🔎 Similar Papers
No similar papers found.
Vukosi Marivate
Vukosi Marivate
University of Pretoria, Lelapa AI, Deep Learning Indaba, Masakhane Research Foundation
Data ScienceNatural Language ProcessingMachine LearningArtificial IntelligenceReinforcement
I
Isheanesu Dzingirai
DSFSI, Dept. of Computer Science, University of Pretoria
F
Fiskani Banda
DSFSI, Dept. of Computer Science, University of Pretoria
R
Richard Lastrucci
DSFSI, Dept. of Computer Science, University of Pretoria
T
Thapelo Sindane
DSFSI, Dept. of Computer Science, University of Pretoria
K
Keabetswe Madumo
DSFSI, Dept. of Computer Science, University of Pretoria
Kayode Olaleye
Kayode Olaleye
Unknown affiliation
speech and language processing
A
Abiodun Modupe
DSFSI, Dept. of Computer Science, University of Pretoria
U
Unarine Netshifhefhe
DSFSI, Dept. of Computer Science, University of Pretoria
H
Herkulaas Combrink
Economics and Management Sciences, University of the Free State; Interdisciplinary Centre for Digital Futures, University of the Free State
M
Mohlatlego Nakeng
DSFSI, Dept. of Computer Science, University of Pretoria
M
Matome Ledwaba
DSFSI, Dept. of Computer Science, University of Pretoria