Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address data scarcity, adaptation difficulty, and inadequate evaluation in modeling medium-resource languages like Hindi, this paper introduces Nanda (10B), an open-source instruction-tuned model built upon Llama-3-8B with three key innovations: (1) the first Hindi-adapted continuous pretraining paradigm leveraging the Llama Pro architecture; (2) a bilingual balancing strategy to enhance low-resource language representation; and (3) a lightweight safety alignment framework integrating RLHF and DPO, coupled with a multi-granularity Hindi evaluation suite (HIQA, HindiMMLU, XNLI-Hi). Experiments demonstrate that Nanda surpasses comparably sized open-source models—including IndicLLM and Airavata—across multiple benchmarks, achieving state-of-the-art performance among open models. Furthermore, it supports real-world deployment in education and government applications.

Technology Category

Application Category

📝 Abstract

Developing high-quality large language models (LLMs) for moderately resourced languages presents unique challenges in data availability, model adaptation, and evaluation. We introduce Llama-3-Nanda-10B-Chat, or Nanda for short, a state-of-the-art Hindi-centric instruction-tuned generative LLM, designed to push the boundaries of open-source Hindi language models. Built upon Llama-3-8B, Nanda incorporates continuous pre-training with expanded transformer blocks, leveraging the Llama Pro methodology. A key challenge was the limited availability of high-quality Hindi text data; we addressed this through rigorous data curation, augmentation, and strategic bilingual training, balancing Hindi and English corpora to optimize cross-linguistic knowledge transfer. With 10 billion parameters, Nanda stands among the top-performing open-source Hindi and multilingual models of similar scale, demonstrating significant advantages over many existing models. We provide an in-depth discussion of training strategies, fine-tuning techniques, safety alignment, and evaluation metrics, demonstrating how these approaches enabled Nanda to achieve state-of-the-art results. By open-sourcing Nanda, we aim to advance research in Hindi LLMs and support a wide range of real-world applications across academia, industry, and public services.

Problem

Research questions and friction points this paper is trying to address.

Developing high-quality LLMs for Hindi with limited data

Optimizing cross-linguistic knowledge transfer via bilingual training

Achieving state-of-the-art performance in open-source Hindi LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous pre-training with expanded transformer blocks

Rigorous data curation and bilingual training

Open-sourcing state-of-the-art Hindi-centric LLM

🔎 Similar Papers

No similar papers found.