CultranAI at PalmX 2025: Data Augmentation for Cultural Knowledge Representation

📅 2025-08-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limited representation of Arabic cultural knowledge in large language models (LLMs). To mitigate this, we propose a multi-source cultural data fusion strategy for dataset augmentation, constructing a high-quality, culturally grounded multiple-choice dataset comprising over 22,000 items. Leveraging the Fanar-1-9B-Instruct base model, we apply efficient parameter-efficient fine-tuning via LoRA. Crucially, we introduce PalmX and Palm datasets to enhance cross-domain generalization. Experimental results demonstrate that our approach achieves 84.1% accuracy on the PalmX development set and ranks fifth on the blind test set with 70.50% accuracy—marking a substantial improvement in Arabic cultural understanding and reasoning. The work validates the efficacy and scalability of synergistically combining culturally enriched data augmentation with lightweight fine-tuning for optimizing cultural knowledge representation in LLMs.

Technology Category

Application Category

📝 Abstract
In this paper, we report our participation to the PalmX cultural evaluation shared task. Our system, CultranAI, focused on data augmentation and LoRA fine-tuning of large language models (LLMs) for Arabic cultural knowledge representation. We benchmarked several LLMs to identify the best-performing model for the task. In addition to utilizing the PalmX dataset, we augmented it by incorporating the Palm dataset and curated a new dataset of over 22K culturally grounded multiple-choice questions (MCQs). Our experiments showed that the Fanar-1-9B-Instruct model achieved the highest performance. We fine-tuned this model on the combined augmented dataset of 22K+ MCQs. On the blind test set, our submitted system ranked 5th with an accuracy of 70.50%, while on the PalmX development set, it achieved an accuracy of 84.1%.
Problem

Research questions and friction points this paper is trying to address.

Augmenting data for Arabic cultural knowledge representation
Fine-tuning LLMs with LoRA for cultural evaluation tasks
Benchmarking models to identify optimal performance for MCQs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data augmentation with Palm and curated datasets
LoRA fine-tuning of large language models
Fanar-1-9B-Instruct model optimization for Arabic
🔎 Similar Papers
No similar papers found.
H
Hunzalah Hassan Bhatti
Qatar University
Y
Youssef Ahmed
Qatar University
Md Arid Hasan
Md Arid Hasan
PhD Student, University of Toronto
LLMsMultimodalityBias in LLMsResponsible AI
F
Firoj Alam
Qatar Computing Research Institute