TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking

📅 2025-02-16

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

The scarcity of large-scale pretrained language models (LLMs) and dedicated evaluation benchmarks for Bangla hinders progress in low-resource language modeling. Method: We introduce TituLLMs—the first open-source Bangla-centric LLM family (1B/3B parameters)—trained on a high-quality, self-constructed corpus of 37 billion tokens. We innovatively extend the Llama-3.2 tokenizer with Unicode-aware tokenization and culturally grounded knowledge embedding. Additionally, we curate five novel, human-validated Bangla-specific benchmarks covering commonsense reasoning, reading comprehension, and cultural knowledge. Results: TituLLMs achieve significant performance gains over multilingual baselines across diverse downstream tasks. All model weights and benchmark datasets are publicly released, establishing foundational infrastructure and an efficient adaptation paradigm for Bangla and other under-resourced languages.

Technology Category

Application Category

📝 Abstract

In this paper, we present TituLLMs, the first large pretrained Bangla LLMs, available in 1B and 3B parameter sizes. Due to computational constraints during both training and inference, we focused on smaller models. To train TituLLMs, we collected a pretraining dataset of approximately 37 billion tokens. We extended the Llama-3.2 tokenizer to incorporate language- and culture-specific knowledge, which also enables faster training and inference. There was a lack of benchmarking datasets to evaluate LLMs for Bangla. To address this gap, we developed five benchmarking datasets. We benchmarked various LLMs, including TituLLMs, and demonstrated that TituLLMs outperforms its initial multilingual versions. However, this is not always the case, highlighting the complexities of language adaptation. Our work lays the groundwork for adapting existing multilingual open models to other low-resource languages. To facilitate broader adoption and further research, we have made the TituLLMs models and benchmarking datasets publicly available (https://huggingface.co/collections/hishab/titulm-llama-family-6718d31fc1b83529276f490a).

Problem

Research questions and friction points this paper is trying to address.

Develop first Bangla pretrained large models

Create benchmarking datasets for Bangla LLMs

Adapt multilingual models to low-resource languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed TituLLMs, first Bangla LLMs

Extended Llama-3.2 for faster processing

Created five new benchmarking datasets

🔎 Similar Papers

No similar papers found.