🤖 AI Summary
For low-resource Pakistani languages—including Urdu, Shahmukhi, Sindhi, and Pashto—named entity recognition (NER) suffers from severe annotation scarcity and inadequate contextual representations in pretrained models. To address this, we propose a culturally adapted cross-lingual data augmentation framework. Methodologically, we systematically validate the effectiveness of fine-tuning multilingual masked language models (e.g., XLM-R) on Shahmukhi and Pashto for the first time, integrating prompt-driven generative data augmentation with few-shot learning. Our contributions are twofold: (1) designing cross-lingual augmentation strategies explicitly aligned with South Asian orthographic conventions and cultural context; and (2) uncovering synergistic gains of large generative models in ultra-low-resource NER settings. Experiments demonstrate that our approach significantly outperforms zero-shot cross-lingual transfer and conventional data augmentation baselines on Shahmukhi and Pashto NER, achieving absolute F1-score improvements of 12.3–18.7 percentage points.
📝 Abstract
Named Entity Recognition (NER), a fundamental task in Natural Language Processing (NLP), has shown significant advancements for high-resource languages. However, due to a lack of annotated datasets and limited representation in Pre-trained Language Models (PLMs), it remains understudied and challenging for low-resource languages. To address these challenges, we propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages; Urdu, Shahmukhi, Sindhi, and Pashto. By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto. We further explore the capability of generative LLMs for NER and data augmentation using few-shot learning.