Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the persistent underrepresentation of low-resource languages like Hindi in large multilingual models, which exacerbates linguistic inequity in NLP. We present LilMoo, a 0.6B-parameter Hindi-specific language model trained from scratch using a transparent, reproducible, and resource-efficient pipeline. Leveraging the high-quality GigaLekh corpus, we employ heuristic rules combined with an LLM-as-a-judge strategy for data curation and augment training with carefully selected English–Hindi bilingual data. Experimental results demonstrate that LilMoo consistently outperforms comparable-scale multilingual baselines—including Qwen2.5-0.5B and Qwen3-0.6B—across multiple tasks, establishing that small-scale monolingual pretraining can effectively rival larger multilingual models at sub-billion parameter scales.

Technology Category

Application Category

📝 Abstract
The dominance of large multilingual foundation models has widened linguistic inequalities in Natural Language Processing (NLP), often leaving low-resource languages underrepresented. This paper introduces LilMoo, a 0.6-billion-parameter Hindi language model trained entirely from scratch to address this gap. Unlike prior Hindi models that rely on continual pretraining from opaque multilingual foundations, LilMoo is developed through a fully transparent and reproducible pipeline optimized for limited compute environments. We construct a high-quality Hindi corpus (GigaLekh) filtered through both heuristic and learned (LLM-as-a-judge) methods, complemented by bilingual augmentation with curated English data. Using this dataset, we explore various training recipes for small-scale language models. Across comprehensive evaluation suites, LilMoo consistently outperforms comparably sized multilingual baselines such as Qwen2.5-0.5B and Qwen3-0.6B, demonstrating that well-designed language-specific pretraining can rival large multilingual models at the sub-billion-parameter range.
Problem

Research questions and friction points this paper is trying to address.

linguistic inequality
low-resource languages
Hindi NLP
multilingual foundation models
language representation gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

language-specific pretraining
low-resource languages
transparent training pipeline
bilingual data augmentation
compact language model
🔎 Similar Papers
No similar papers found.
S
Shiza Fatimah
Bonn-Aachen International Center for Information Technology (b-it) / CAISA Lab; Lamarr Institute for Machine Learning and Artificial Intelligence
A
Aniket Sen
Helmholtz-Institut für Strahlen- und Kernphysik
S
Sophia Falk
Bonn Sustainable AI Lab
Florian Mai
Florian Mai
Junior Research Group Leader, Uni Bonn
AI alignmentLLM reasoningLLMs
Lucie Flek
Lucie Flek
University of Bonn, Lamarr Institute of Machine Learning and Artificial Intelligence
Natural Language ProcessingMachine LearningPhysicsComputational Social Sciences
N
Nicholas Kluge Corrêa
Bonn-Aachen International Center for Information Technology (b-it) / CAISA Lab; Lamarr Institute for Machine Learning and Artificial Intelligence