PARAM-1 BharatGen 2.9B Model

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Mainstream large language models (LLMs) exhibit structural linguistic underrepresentation due to their English-centric design, failing to accommodate India’s multilingual reality—characterized by 20+ official languages, pervasive code-switching, and diglossia. Method: We introduce the first natively Indian multilingual LLM, trained from scratch with a decoder-only architecture (2.9B parameters). Our pretraining incorporates language-fair design: a rigid 25% quota of Indian-language data, a morphology-aware, locally adapted SentencePiece tokenizer, and the first India-specific evaluation benchmark. Contribution/Results: The model achieves competitive general-purpose capabilities while substantially outperforming existing SOTA on Indian-domain question answering, mixed-language reasoning, and sociolinguistic robustness tasks. It establishes a new paradigm for equitable multilingual modeling and provides a rigorous, community-grounded baseline for future research.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have emerged as powerful general-purpose reasoning systems, yet their development remains dominated by English-centric data, architectures, and optimization paradigms. This exclusionary design results in structural under-representation of linguistically diverse regions such as India, where over 20 official languages and 100+ dialects coexist alongside phenomena like code-switching and diglossia. We introduce PARAM-1, a 2.9B parameter decoder-only, text-only language model trained from scratch with an explicit architectural and linguistic focus on Indian diversity. PARAM-1 is trained on a bilingual dataset consisting of only Hindi and English, constructed with a strong focus on fact-rich, high-quality content. It is guided by three core principles: equitable representation of Indic languages through a 25% corpus allocation; tokenization fairness via a SentencePiece tokenizer adapted to Indian morphological structures; and culturally aligned evaluation benchmarks across IndicQA, code-mixed reasoning, and socio-linguistic robustness tasks. By embedding diversity at the pretraining level-rather than deferring it to post-hoc alignment-PARAM-1 offers a design-first blueprint for equitable foundation modeling. Our results demonstrate that it serves as both a competent general-purpose model and a robust baseline for India-centric applications.
Problem

Research questions and friction points this paper is trying to address.

Addressing under-representation of Indian languages in LLMs
Developing a Hindi-English bilingual model for India
Ensuring tokenization fairness and cultural alignment in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual Hindi-English dataset for Indian diversity
SentencePiece tokenizer for Indian morphology
Culturally aligned evaluation benchmarks
🔎 Similar Papers
No similar papers found.
K
Kundeshwar Pundalik
BharatGen Team
P
Piyush Sawarkar
BharatGen Team
N
Nihar Sahoo
BharatGen Team
A
Abhishek Shinde
BharatGen Team
P
Prateek Chanda
BharatGen Team
V
Vedant Goswami
BharatGen Team
A
Ajay Nagpal
BharatGen Team
Atul Singh
Atul Singh
Applied Researcher
V
Viraj Thakur
BharatGen Team
V
Vijay Dewane
BharatGen Team
Aamod Thakur
Aamod Thakur
BharatGen
Natural Language Processing
B
Bhargav Patel
BharatGen Team
S
Smita Gautam
BharatGen Team
B
Bhagwan Panditi
BharatGen Team
S
Shyam Pawar
BharatGen Team
M
Madhav Kotcha
BharatGen Team
S
Suraj Racha
BharatGen Team
S
Saral Sureka
BharatGen Team
P
Pankaj Singh
BharatGen Team
R
Rishi Bal
BharatGen Team
R
Rohit Saluja
BharatGen Team
Ganesh Ramakrishnan
Ganesh Ramakrishnan
Professor, Department of Computer Science and Engineering, Indian Institute of Technology Bombay
Machine LearningRelational LearningInformation ExtractionQuestion AnsweringText Analytics