🤖 AI Summary
Bengali—the world’s fifth most spoken language—lacks high-performance, reproducible open-source large language models (LLMs), reflecting a critical gap in low-resource language LLM development. Method: We introduce TigerLLM, the first high-performance, fully reproducible open-source Bengali LLM family, built upon the standard Transformer architecture. Our approach comprises curated high-quality Bengali corpus construction, multi-stage continued pretraining, and instruction fine-tuning. Contribution/Results: TigerLLM is the first open-source Bengali LLM to outperform GPT-3.5 across multiple benchmarks—including BanglaBench and BBQ-LM—and significantly surpasses all existing open-source Bengali LLMs. Crucially, we ensure end-to-end reproducibility, community-oriented design (e.g., modular training scripts, comprehensive documentation), and lightweight deployment support (e.g., quantized variants). By establishing a scalable, transparent, and accessible framework, TigerLLM bridges a key gap for low-resource language LLMs and provides a generalizable paradigm for non-English LLM development.
📝 Abstract
The development of Large Language Models (LLMs) remains heavily skewed towards English and a few other high-resource languages. This linguistic disparity is particularly evident for Bangla - the 5th most spoken language. A few initiatives attempted to create open-source Bangla LLMs with performance still behind high-resource languages and limited reproducibility. To address this gap, we introduce TigerLLM - a family of Bangla LLMs. Our results demonstrate that these models surpass all open-source alternatives and also outperform larger proprietary models like GPT3.5 across standard benchmarks, establishing TigerLLM as the new baseline for future Bangla language modeling.