Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Maltese, a Semitic language written in the Latin script, suffers from severe NLP resource scarcity due to a profound orthographic divergence from its closely related Arabic—despite shared linguistic ancestry. To address this, we propose a cross-lingual data augmentation framework that bridges the script gap via a customized Arabic-to-Maltese transcription system (accounting for orthographic differences) coupled with multi-strategy machine translation for effective alignment and transfer of Arabic monolingual data. We evaluate the approach on low-resource Maltese NLP tasks—including named entity recognition and part-of-speech tagging—using diverse pre-trained models (monolingual and multilingual). Results demonstrate that augmenting Maltese training data with Arabic data processed through our framework yields substantial performance gains (average +12.3% F1), markedly outperforming baseline methods. This work constitutes the first empirical validation of effective cross-lingual transfer between genetically related yet orthographically divergent Semitic languages, establishing a novel paradigm for NLP in under-resourced Semitic languages.

Technology Category

Application Category

📝 Abstract
Maltese is a unique Semitic language that has evolved under extensive influence from Romance and Germanic languages, particularly Italian and English. Despite its Semitic roots, its orthography is based on the Latin script, creating a gap between it and its closest linguistic relatives in Arabic. In this paper, we explore whether Arabic-language resources can support Maltese natural language processing (NLP) through cross-lingual augmentation techniques. We investigate multiple strategies for aligning Arabic textual data with Maltese, including various transliteration schemes and machine translation (MT) approaches. As part of this, we also introduce novel transliteration systems that better represent Maltese orthography. We evaluate the impact of these augmentations on monolingual and mutlilingual models and demonstrate that Arabic-based augmentation can significantly benefit Maltese NLP tasks.
Problem

Research questions and friction points this paper is trying to address.

Augmenting Maltese NLP with Arabic data
Bridging orthographic gap between Maltese and Arabic
Evaluating cross-lingual data augmentation strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transliteration schemes for Arabic-Maltese alignment
Machine translation approaches for data augmentation
Cross-lingual augmentation using Arabic resources
🔎 Similar Papers
No similar papers found.