TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the severe scarcity of high-quality data and dedicated models for code generation in Bangla—the world’s fifth most spoken language—this work introduces MBPP-Bangla, the first Bangla code instruction dataset and evaluation benchmark. We further propose TigerCoder, the first open-source large language model (LLM) family specifically designed for Bangla code generation, with 1B and 9B parameter variants. Leveraging instruction tuning and domain adaptation, TigerCoder is trained exclusively on high-quality Bangla code data. Experimental results demonstrate that TigerCoder achieves a 11–18% improvement in Pass@1 over existing multilingual and general-purpose Bangla LMs, empirically validating the efficacy of pairing compact models with high-fidelity domain-specific data for low-resource language code generation. All components—including the dataset, models, and evaluation framework—are publicly released to foster community advancement.

Technology Category

Application Category

📝 Abstract
Despite being the 5th most spoken language, Bangla remains underrepresented in Large Language Models (LLMs), particularly for code generation. This primarily stems from the scarcity of high-quality data to pre-train and/or finetune such models. Hence, we introduce the first dedicated family of Code LLMs for Bangla (1B & 9B). We offer three major contributions: (1) a comprehensive Bangla code instruction datasets for programming domain adaptation; (2) MBPP-Bangla, an evaluation benchmark for Bangla code generation; and (3) the TigerCoder-family of Code LLMs, achieving significant ~11-18% performance gains at Pass@1 over existing multilingual and general-purpose Bangla LLMs. Our findings show that curated, high-quality datasets can overcome limitations of smaller models for low-resource languages. We open-source all resources to advance further Bangla LLM research.
Problem

Research questions and friction points this paper is trying to address.

Addresses Bangla's underrepresentation in code generation LLMs
Overcomes scarcity of high-quality Bangla programming data
Introduces specialized models and benchmarks for Bangla coding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bangla code instruction datasets creation
MBPP-Bangla evaluation benchmark development
TigerCoder LLMs with 11-18% performance gains
🔎 Similar Papers
No similar papers found.