🤖 AI Summary
To address the scarcity of translation resources and insufficient technical inclusivity between Standard Bengali and five major dialects—Chittagonian, Sylheti, Barisal, Noakhali, and Mymensingh—this paper introduces the first systematic neural machine translation (NMT) framework for Standard Bengali-to-dialect translation. We propose BanglaT5, a dialect-adapted fine-tuning paradigm tailored to low-resource dialect settings, which significantly outperforms mT5 and mBART50 on the Vashantor parallel corpus (32.5K sentence pairs). Evaluation using both Character Error Rate (CER) and Word Error Rate (WER) yields state-of-the-art results: 12.3% CER and 15.7% WER. Furthermore, we release the first open-source collection of Bengali dialect NMT models, establishing a foundational resource for dialectal NMT research. This work bridges a critical gap in multilingual NMT, advancing linguistic diversity preservation and enabling localized language technology deployment.
📝 Abstract
The Bangla language includes many regional dialects, adding to its cultural richness. The translation of Bangla Language into regional dialects presents a challenge due to significant variations in vocabulary, pronunciation, and sentence structure across regions like Chittagong, Sylhet, Barishal, Noakhali, and Mymensingh. These dialects, though vital to local identities, lack of representation in technological applications. This study addresses this gap by translating standard Bangla into these dialects using neural machine translation (NMT) models, including BanglaT5, mT5, and mBART50. The work is motivated by the need to preserve linguistic diversity and improve communication among dialect speakers. The models were fine-tuned using the"Vashantor"dataset, containing 32,500 sentences across various dialects, and evaluated through Character Error Rate (CER) and Word Error Rate (WER) metrics. BanglaT5 demonstrated superior performance with a CER of 12.3% and WER of 15.7%, highlighting its effectiveness in capturing dialectal nuances. The outcomes of this research contribute to the development of inclusive language technologies that support regional dialects and promote linguistic diversity.