Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

To address the scarcity of high-quality data, multilingual inconsistency, and insufficient safety alignment in large language model (LLM) training for low-resource languages—specifically Urdu—this paper introduces Urdu-Instruct: the first high-quality, multilingual synthetic instruction dataset tailored for Urdu, integrating bilingual translation, cultural adaptation, and ethical alignment. Methodologically, we enhance the self-instruction paradigm by incorporating task-specific prompts and a global task pool, while embedding Urdu-native chain-of-thought reasoning. Leveraging Llama-3.1-8B, we perform joint fine-tuning via multilingual synthetic data distillation, bilingual alignment, and safety alignment. The resulting open-weight model, Alif-1.0-8B-Instruct, significantly outperforms Llama-3.1-8B-Instruct and leading multilingual LLMs on Urdu-specific benchmarks. Notably, training costs remain under USD 100. All code, data, and model weights are publicly released to foster reproducibility and community advancement.

Technology Category

Application Category

📝 Abstract

Developing a high-performing large language models (LLMs) for low-resource languages such as Urdu, present several challenges. These challenges include the scarcity of high-quality datasets, multilingual inconsistencies, and safety concerns. Existing multilingual LLMs often address these issues by translating large volumes of available data. However, such translations often lack quality and cultural nuance while also incurring significant costs for data curation and training. To address these issues, we propose Alif-1.0-8B-Instruct, a multilingual Urdu-English model, that tackles these challenges with a unique approach. We train the model on a high-quality, multilingual synthetic dataset (Urdu-Instruct), developed using a modified self-instruct technique. By using unique prompts and seed values for each task along with a global task pool, this dataset incorporates Urdu-native chain-of-thought based reasoning, bilingual translation, cultural relevance, and ethical safety alignments. This technique significantly enhances the comprehension of Alif-1.0-8B-Instruct model for Urdu-specific tasks. As a result, Alif-1.0-8B-Instruct, built upon the pretrained Llama-3.1-8B, demonstrates superior performance compared to Llama-3.1-8B-Instruct for Urdu specific-tasks. It also outperformed leading multilingual LLMs, including Mistral-7B-Instruct-v0.3, Qwen-2.5-7B-Instruct, and Cohere-Aya-Expanse-8B, all within a training budget of under $100. Our results demonstrate that high-performance and low-resource language LLMs can be developed efficiently and culturally aligned using our modified self-instruct approach. All datasets, models, and code are publicly available at: https://github.com/traversaal-ai/alif-urdu-llm.

Problem

Research questions and friction points this paper is trying to address.

Developing Urdu LLMs with limited datasets and cultural relevance

Overcoming poor translation quality and high training costs

Creating efficient bilingual models using synthetic data distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modified self-instruct technique generates multilingual synthetic data

Global task pool with unique prompts enhances Urdu-native reasoning

Culturally aligned dataset improves Urdu task performance efficiently

🔎 Similar Papers

Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks

2024-05-24Citations: 2