Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address data scarcity, adaptation difficulty, and inadequate evaluation in modeling medium-resource languages like Hindi, this paper introduces Nanda (10B), an open-source instruction-tuned model built upon Llama-3-8B with three key innovations: (1) the first Hindi-adapted continuous pretraining paradigm leveraging the Llama Pro architecture; (2) a bilingual balancing strategy to enhance low-resource language representation; and (3) a lightweight safety alignment framework integrating RLHF and DPO, coupled with a multi-granularity Hindi evaluation suite (HIQA, HindiMMLU, XNLI-Hi). Experiments demonstrate that Nanda surpasses comparably sized open-source models—including IndicLLM and Airavata—across multiple benchmarks, achieving state-of-the-art performance among open models. Furthermore, it supports real-world deployment in education and government applications.

Technology Category

Application Category

📝 Abstract
Developing high-quality large language models (LLMs) for moderately resourced languages presents unique challenges in data availability, model adaptation, and evaluation. We introduce Llama-3-Nanda-10B-Chat, or Nanda for short, a state-of-the-art Hindi-centric instruction-tuned generative LLM, designed to push the boundaries of open-source Hindi language models. Built upon Llama-3-8B, Nanda incorporates continuous pre-training with expanded transformer blocks, leveraging the Llama Pro methodology. A key challenge was the limited availability of high-quality Hindi text data; we addressed this through rigorous data curation, augmentation, and strategic bilingual training, balancing Hindi and English corpora to optimize cross-linguistic knowledge transfer. With 10 billion parameters, Nanda stands among the top-performing open-source Hindi and multilingual models of similar scale, demonstrating significant advantages over many existing models. We provide an in-depth discussion of training strategies, fine-tuning techniques, safety alignment, and evaluation metrics, demonstrating how these approaches enabled Nanda to achieve state-of-the-art results. By open-sourcing Nanda, we aim to advance research in Hindi LLMs and support a wide range of real-world applications across academia, industry, and public services.
Problem

Research questions and friction points this paper is trying to address.

Developing high-quality LLMs for Hindi with limited data
Optimizing cross-linguistic knowledge transfer via bilingual training
Achieving state-of-the-art performance in open-source Hindi LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous pre-training with expanded transformer blocks
Rigorous data curation and bilingual training
Open-sourcing state-of-the-art Hindi-centric LLM
🔎 Similar Papers
No similar papers found.
Monojit Choudhury
Monojit Choudhury
Professor of Natural Language Processing, MBZUAI
Natural Language ProcessingLarge Language ModelsEthics of AIComputational Social Science
S
Shivam Chauhan
Mohamed Bin Zayed University of Artificial Intelligence, UAE
Rocktim Jyoti Das
Rocktim Jyoti Das
University of Maryland, College Park
Machine LearningRobotics
Dhruv Sahnan
Dhruv Sahnan
PhD Student in NLP @ MBZUAI
misinformationdisinformationfact-checkinghuman-ai collaboration for fact-checking
X
Xudong Han
Mohamed Bin Zayed University of Artificial Intelligence, UAE
H
Haonan Li
Mohamed Bin Zayed University of Artificial Intelligence, UAE
Aaryamonvikram Singh
Aaryamonvikram Singh
IFM, MBZUAI
NLPLLMs
A
Alok Anil Jadhav
Mohamed Bin Zayed University of Artificial Intelligence, UAE
Utkarsh Agarwal
Utkarsh Agarwal
MBZUAI, NLP PhD Student
Machine LearningArtificial LearningNLPComputer VisionNAS
Mukund Choudhary
Mukund Choudhary
Mohamed Bin Zayed University of Artificial Intelligence
Computational LinguisticsMetalinguisticsLLMsCognitive ScienceNatural Language Processing
Debopriyo Banerjee
Debopriyo Banerjee
Postdoctoral Researcher, Mohamed bin Zayed University of Artificial Intelligence
Natural Language ProcessingRecommendation SystemsNeuroSymbolic-AI
Fajri Koto
Fajri Koto
Assistant Professor (tenure-track), MBZUAI
Computational LinguisticsNatural Language ProcessingMultilingual NLPHuman-centered NLP
J
Junaid Bhat
Inception, UAE
A
Awantika Shukla
Inception, UAE
Samujjwal Ghosh
Samujjwal Ghosh
Inception
Large Language ModelsNatural Language ProcessingDomain AdaptationGraph Neural Networks
S
Samta Kamboj
Inception, UAE
Onkar Pandit
Onkar Pandit
INRIA Lille, France
NLPMachine learningDeep Learning
L
Lalit Pradhan
Inception, UAE
R
Rahul Pal
Inception, UAE
S
Sunil Sahu
Inception, UAE
S
Soundar Doraiswamy
Inception, UAE
P
Parvez Mullah
Inception, UAE
A
Ali El Filali
Inception, UAE
N
Neha Sengupta
Inception, UAE
G
Gokul Ramakrishnan
Cerebras Systems
Rituraj Joshi
Rituraj Joshi
Applied ML Scientist, Cerebras Systems
Machine LearningDeep LearningArtificial IntelligenceAlgorithms
Gurpreet Gosal
Gurpreet Gosal
Cerebras Systems
language modellingmachine learningoptimization
A
Avraham Sheinin
Cerebras Systems
Natalia Vassilieva
Natalia Vassilieva
Sr. Director of Product, Cerebras Systems
image analysisinformation retrievalinformatin extractionmachine learningnatural language processing
Preslav Nakov
Preslav Nakov
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computational LinguisticsLarge Language ModelsFact-checkingFake News