BanglaByT5: Byte-Level Modelling for Bangla

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional subword tokenizers (e.g., BPE, SentencePiece) suffer from segmentation loss and inadequate modeling of rich morphology and out-of-vocabulary words in morphologically complex, low-resource languages like Bangla. Method: This paper introduces the first byte-level encoder-decoder model specifically designed for Bangla. By adopting byte-level tokenization—bypassing word segmentation entirely—we eliminate tokenization ambiguity and enhance robustness to morphological variation and unseen tokens. Built upon a lightweight ByT5 architecture, the model is pretrained on 14 GB of high-quality Bangla text, supporting both zero-shot and supervised generation and classification tasks. Contribution/Results: Experimental results demonstrate that our model surpasses multilingual baselines such as mT5-base across multiple generation and classification benchmarks, despite having significantly fewer parameters. It establishes an efficient, lightweight, and scalable foundation model tailored for resource-constrained languages, advancing practical NLP for Bangla.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have achieved remarkable success across various natural language processing tasks. However, most LLM models use traditional tokenizers like BPE and SentencePiece, which fail to capture the finer nuances of a morphologically rich language like Bangla (Bengali). In this work, we introduce BanglaByT5, the first byte-level encoder-decoder model explicitly tailored for Bangla. Built upon a small variant of Googles ByT5 architecture, BanglaByT5 is pre-trained on a 14GB curated corpus combining high-quality literary and newspaper articles. Through zeroshot and supervised evaluations across generative and classification tasks, BanglaByT5 demonstrates competitive performance, surpassing several multilingual and larger models. Our findings highlight the efficacy of byte-level modelling for morphologically rich languages and highlight BanglaByT5 potential as a lightweight yet powerful tool for Bangla NLP, particularly in both resource-constrained and scalable environments.
Problem

Research questions and friction points this paper is trying to address.

Addressing Bangla language nuances in large language models
Developing byte-level model for morphologically rich Bangla
Improving NLP performance in resource-constrained environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Byte-level encoder-decoder model for Bangla
Pre-trained on 14GB curated Bangla corpus
Lightweight yet powerful for Bangla NLP
🔎 Similar Papers
No similar papers found.