Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance

📅 2025-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit weak multilingual capabilities on low-resource languages (e.g., Hindi) and often suffer from native-language performance degradation during multilingual adaptation. Method: We propose Mantra-14B, a lightweight bilingual model built upon Qwen-2.5-14B-Instruct and Phi-4, trained exclusively via culture-aware English–Hindi instruction tuning on 485K high-quality bilingual samples—without vocabulary expansion, parameter increase, or architectural modification. Contribution/Results: To our knowledge, this is the first work to demonstrate that balanced bilingual data proportions alone can jointly enhance both language capabilities. Ablation studies confirm the critical role of culturally localized data. Mantra-14B achieves an average +3% improvement over baselines on English and Hindi benchmarks—outperforming a twice-larger baseline model. The code, dataset, and model are fully open-sourced under MIT/Apache licenses.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have shown remarkable capabilities, but their development has primarily focused on English and other high-resource languages, leaving many languages underserved. We present our latest Hindi-English bi-lingual LLM extbf{Mantra-14B} with ~3% average improvement in benchmark scores over both languages, outperforming models twice its size. Using a curated dataset composed of English and Hindi instruction data of 485K samples, we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our experiments encompassing seven different LLMs of varying parameter sizes and over 140 training attempts with varying English-Hindi training data ratios demonstrated that it is possible to significantly improve multilingual performance without compromising native performance. Further, our approach avoids resource-intensive techniques like vocabulary expansion or architectural modifications, thus keeping the model size small. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead. We release our training code, datasets, and models under mit and apache licenses to aid further research towards under-represented and low-resource languages.
Problem

Research questions and friction points this paper is trying to address.

Enhancing Hindi-English bilingual LLM performance without compromising native capabilities
Addressing underserved languages using culturally informed data and modest fine-tuning
Improving multilingual models without resource-intensive techniques or architectural changes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction tuning with curated bilingual dataset
Improves performance without architectural changes
Culturally informed data reduces computational overhead
R
Ram Mohan Rao Kadiyala
Traversaal.ai
S
Siddartha Pullakhandam
Vantager
S
Siddhant Gupta
IIT Roorkee
D
Drishti Sharma
Cohere for AI Community
J
Jebish Purbey
M2ai.in, Pulchowk Campus
K
Kanwal Mehreen
Cohere for AI Community
Muhammad Arham
Muhammad Arham
National University of Sciences and Technology
Machine LearningNatural Language ProcessingComputer VisionGenerative AI
Hamza Farooq
Hamza Farooq
Researcher, University of Minnesota, USA.
Signal ProcessingMedical Image ProcessingDiffusion MRIControls