TigerLLM -- A Family of Bangla Large Language Models

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Bengali—the world’s fifth most spoken language—lacks high-performance, reproducible open-source large language models (LLMs), reflecting a critical gap in low-resource language LLM development. Method: We introduce TigerLLM, the first high-performance, fully reproducible open-source Bengali LLM family, built upon the standard Transformer architecture. Our approach comprises curated high-quality Bengali corpus construction, multi-stage continued pretraining, and instruction fine-tuning. Contribution/Results: TigerLLM is the first open-source Bengali LLM to outperform GPT-3.5 across multiple benchmarks—including BanglaBench and BBQ-LM—and significantly surpasses all existing open-source Bengali LLMs. Crucially, we ensure end-to-end reproducibility, community-oriented design (e.g., modular training scripts, comprehensive documentation), and lightweight deployment support (e.g., quantized variants). By establishing a scalable, transparent, and accessible framework, TigerLLM bridges a key gap for low-resource language LLMs and provides a generalizable paradigm for non-English LLM development.

Technology Category

Application Category

📝 Abstract

The development of Large Language Models (LLMs) remains heavily skewed towards English and a few other high-resource languages. This linguistic disparity is particularly evident for Bangla - the 5th most spoken language. A few initiatives attempted to create open-source Bangla LLMs with performance still behind high-resource languages and limited reproducibility. To address this gap, we introduce TigerLLM - a family of Bangla LLMs. Our results demonstrate that these models surpass all open-source alternatives and also outperform larger proprietary models like GPT3.5 across standard benchmarks, establishing TigerLLM as the new baseline for future Bangla language modeling.

Problem

Research questions and friction points this paper is trying to address.

Addresses linguistic disparity in LLM development for Bangla.

Improves performance of open-source Bangla LLMs over existing models.

Establishes TigerLLM as the new baseline for Bangla language modeling.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed TigerLLM for Bangla language modeling

Surpassed open-source and proprietary LLM benchmarks

Established new baseline for Bangla language models

🔎 Similar Papers

No similar papers found.