🤖 AI Summary
South Africa’s official languages have long lacked structured, machine-readable terminology resources, severely hindering multilingual natural language processing (NLP) development. Method: This paper introduces the first African-centered, open-source multilingual terminology framework—built on NOODL—that systematically integrates, cleans, and standardizes fragmented, unstructured term lists from government and academic institutions into an open, interoperable multilingual terminology repository. It innovatively incorporates retrieval-augmented generation (RAG) to dynamically inject structured terminology into large language model (LLM)-based translation pipelines. Contribution/Results: Experiments demonstrate significant improvements in accuracy and domain consistency for English-to-Tsonga translation, validating the terminology repository’s critical role in advancing equitable, locally grounded NLP technologies. The framework bridges infrastructural gaps in African language NLP and supports scalable, context-aware, and terminologically precise machine translation.
📝 Abstract
The critical lack of structured terminological data for South Africa's official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. emph{Marito} addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational emph{Marito} dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models. emph{Marito} provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa's rich linguistic diversity is represented in the digital age.