TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking

📅 2025-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The scarcity of large-scale pretrained language models (LLMs) and dedicated evaluation benchmarks for Bangla hinders progress in low-resource language modeling. Method: We introduce TituLLMs—the first open-source Bangla-centric LLM family (1B/3B parameters)—trained on a high-quality, self-constructed corpus of 37 billion tokens. We innovatively extend the Llama-3.2 tokenizer with Unicode-aware tokenization and culturally grounded knowledge embedding. Additionally, we curate five novel, human-validated Bangla-specific benchmarks covering commonsense reasoning, reading comprehension, and cultural knowledge. Results: TituLLMs achieve significant performance gains over multilingual baselines across diverse downstream tasks. All model weights and benchmark datasets are publicly released, establishing foundational infrastructure and an efficient adaptation paradigm for Bangla and other under-resourced languages.

Technology Category

Application Category

📝 Abstract
In this paper, we present TituLLMs, the first large pretrained Bangla LLMs, available in 1B and 3B parameter sizes. Due to computational constraints during both training and inference, we focused on smaller models. To train TituLLMs, we collected a pretraining dataset of approximately 37 billion tokens. We extended the Llama-3.2 tokenizer to incorporate language- and culture-specific knowledge, which also enables faster training and inference. There was a lack of benchmarking datasets to evaluate LLMs for Bangla. To address this gap, we developed five benchmarking datasets. We benchmarked various LLMs, including TituLLMs, and demonstrated that TituLLMs outperforms its initial multilingual versions. However, this is not always the case, highlighting the complexities of language adaptation. Our work lays the groundwork for adapting existing multilingual open models to other low-resource languages. To facilitate broader adoption and further research, we have made the TituLLMs models and benchmarking datasets publicly available (https://huggingface.co/collections/hishab/titulm-llama-family-6718d31fc1b83529276f490a).
Problem

Research questions and friction points this paper is trying to address.

Develop first Bangla pretrained large models
Create benchmarking datasets for Bangla LLMs
Adapt multilingual models to low-resource languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed TituLLMs, first Bangla LLMs
Extended Llama-3.2 for faster processing
Created five new benchmarking datasets
🔎 Similar Papers
No similar papers found.
S
Shahriar Kabir Nahin
Hishab Singapore Pte. Ltd, Singapore
R
R. N. Nandi
Hishab Singapore Pte. Ltd, Singapore
S
Sagor Sarker
Hishab Singapore Pte. Ltd, Singapore
Q
Quazi Sarwar Muhtaseem
Hishab Singapore Pte. Ltd, Singapore
Md Kowsher
Md Kowsher
Research Assistant, Stevens Institute of Technology
NLPDeep learningComputer Vision
A
Apu Chandraw Shill
Hishab Singapore Pte. Ltd, Singapore
M
Md Ibrahim
Hishab Singapore Pte. Ltd, Singapore
M
Mehadi Hasan Menon
Hishab Singapore Pte. Ltd, Singapore
Tareq Al Muntasir
Tareq Al Muntasir
Chief Technology Officer, Verbex.ai (formerly Hishab)
Automatic Speech recognitionText to speech