🤖 AI Summary
To address the suboptimal performance of multilingual large language models (LLMs) on low-resource languages—specifically Urdu—this paper introduces UrduLLaMA 1.0, the first LLM exclusively designed for Urdu. Built upon the Llama-3.1-8B-Instruct architecture, it innovatively combines continual pretraining with parameter-efficient fine-tuning (PEFT) under extremely limited data conditions: 128M Urdu tokens for continual pretraining, followed by LoRA-based fine-tuning on 41K Urdu instruction-response pairs and 50K English–Urdu parallel sentence pairs. This hybrid adaptation strategy substantially enhances Urdu language understanding, instruction following, and English–Urdu translation capabilities. On three major machine translation benchmarks, UrduLLaMA 1.0 achieves BLEU scores surpassing prior state-of-the-art (SOTA) methods by +4.2–6.8 points. The work establishes a new paradigm and benchmark for high-quality, small-scale LLM adaptation in low-resource settings.
📝 Abstract
Multilingual Large Language Models (LLMs) often provide suboptimal performance on low-resource languages like Urdu. This paper introduces UrduLLaMA 1.0, a model derived from the open-source Llama-3.1-8B-Instruct architecture and continually pre-trained on 128 million Urdu tokens, capturing the rich diversity of the language. To enhance instruction-following and translation capabilities, we leverage Low-Rank Adaptation (LoRA) to fine tune the model on 41,000 Urdu instructions and approximately 50,000 English-Urdu translation pairs. Evaluation across three machine translation datasets demonstrates significant performance improvements compared to state-of-the-art (SOTA) models, establishing a new benchmark for Urdu LLMs. These findings underscore the potential of targeted adaptation strategies with limited data and computational resources to address the unique challenges of low-resource languages.