🤖 AI Summary
To address data scarcity, adaptation difficulty, and inadequate evaluation in modeling medium-resource languages like Hindi, this paper introduces Nanda (10B), an open-source instruction-tuned model built upon Llama-3-8B with three key innovations: (1) the first Hindi-adapted continuous pretraining paradigm leveraging the Llama Pro architecture; (2) a bilingual balancing strategy to enhance low-resource language representation; and (3) a lightweight safety alignment framework integrating RLHF and DPO, coupled with a multi-granularity Hindi evaluation suite (HIQA, HindiMMLU, XNLI-Hi). Experiments demonstrate that Nanda surpasses comparably sized open-source models—including IndicLLM and Airavata—across multiple benchmarks, achieving state-of-the-art performance among open models. Furthermore, it supports real-world deployment in education and government applications.
📝 Abstract
Developing high-quality large language models (LLMs) for moderately resourced languages presents unique challenges in data availability, model adaptation, and evaluation. We introduce Llama-3-Nanda-10B-Chat, or Nanda for short, a state-of-the-art Hindi-centric instruction-tuned generative LLM, designed to push the boundaries of open-source Hindi language models. Built upon Llama-3-8B, Nanda incorporates continuous pre-training with expanded transformer blocks, leveraging the Llama Pro methodology. A key challenge was the limited availability of high-quality Hindi text data; we addressed this through rigorous data curation, augmentation, and strategic bilingual training, balancing Hindi and English corpora to optimize cross-linguistic knowledge transfer. With 10 billion parameters, Nanda stands among the top-performing open-source Hindi and multilingual models of similar scale, demonstrating significant advantages over many existing models. We provide an in-depth discussion of training strategies, fine-tuning techniques, safety alignment, and evaluation metrics, demonstrating how these approaches enabled Nanda to achieve state-of-the-art results. By open-sourcing Nanda, we aim to advance research in Hindi LLMs and support a wide range of real-world applications across academia, industry, and public services.