PARAM-1 BharatGen 2.9B Model

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Mainstream large language models (LLMs) exhibit structural linguistic underrepresentation due to their English-centric design, failing to accommodate India’s multilingual reality—characterized by 20+ official languages, pervasive code-switching, and diglossia. Method: We introduce the first natively Indian multilingual LLM, trained from scratch with a decoder-only architecture (2.9B parameters). Our pretraining incorporates language-fair design: a rigid 25% quota of Indian-language data, a morphology-aware, locally adapted SentencePiece tokenizer, and the first India-specific evaluation benchmark. Contribution/Results: The model achieves competitive general-purpose capabilities while substantially outperforming existing SOTA on Indian-domain question answering, mixed-language reasoning, and sociolinguistic robustness tasks. It establishes a new paradigm for equitable multilingual modeling and provides a rigorous, community-grounded baseline for future research.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have emerged as powerful general-purpose reasoning systems, yet their development remains dominated by English-centric data, architectures, and optimization paradigms. This exclusionary design results in structural under-representation of linguistically diverse regions such as India, where over 20 official languages and 100+ dialects coexist alongside phenomena like code-switching and diglossia. We introduce PARAM-1, a 2.9B parameter decoder-only, text-only language model trained from scratch with an explicit architectural and linguistic focus on Indian diversity. PARAM-1 is trained on a bilingual dataset consisting of only Hindi and English, constructed with a strong focus on fact-rich, high-quality content. It is guided by three core principles: equitable representation of Indic languages through a 25% corpus allocation; tokenization fairness via a SentencePiece tokenizer adapted to Indian morphological structures; and culturally aligned evaluation benchmarks across IndicQA, code-mixed reasoning, and socio-linguistic robustness tasks. By embedding diversity at the pretraining level-rather than deferring it to post-hoc alignment-PARAM-1 offers a design-first blueprint for equitable foundation modeling. Our results demonstrate that it serves as both a competent general-purpose model and a robust baseline for India-centric applications.

Problem

Research questions and friction points this paper is trying to address.

Addressing under-representation of Indian languages in LLMs

Developing a Hindi-English bilingual model for India

Ensuring tokenization fairness and cultural alignment in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual Hindi-English dataset for Indian diversity

SentencePiece tokenizer for Indian morphology

Culturally aligned evaluation benchmarks

🔎 Similar Papers

No similar papers found.