BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Bangla–English translation systems exhibit low accuracy in STEM-domain terminology, causing semantic distortion in technical queries and hindering Bangla-speaking users’ ability to effectively leverage English large language models for domain-specific problem solving. Method: We construct the first high-quality, STEM-focused (computer science, mathematics, physics, etc.) Bangla–English parallel corpus comprising 5,000 sentence pairs. Our curation employs a human-in-the-loop pipeline integrating large language model–assisted generation with rigorous human verification to ensure terminological precision and semantic fidelity. A dedicated translation model is trained on the T5 architecture using this corpus. Contribution/Results: On code generation and mathematical problem-solving tasks, our model achieves a +12.3 BLEU improvement over strong baselines and reduces critical terminology error rate by 67%. Both the dataset and model are publicly released, establishing a reproducible benchmark and practical resource for low-resource language technical translation.

Technology Category

Application Category

📝 Abstract
Large language models work well for technical problem solving in English but perform poorly when the same questions are asked in Bangla. A simple solution would be to translate Bangla questions into English first and then use these models. However, existing Bangla-English translation systems struggle with technical terms. They often mistranslate specialized vocabulary, which changes the meaning of the problem and leads to wrong answers. We present BanglaSTEM, a dataset of 5,000 carefully selected Bangla-English sentence pairs from STEM fields including computer science, mathematics, physics, chemistry, and biology. We generated over 12,000 translations using language models and then used human evaluators to select the highest quality pairs that preserve technical terminology correctly. We train a T5-based translation model on BanglaSTEM and test it on two tasks: generating code and solving math problems. Our results show significant improvements in translation accuracy for technical content, making it easier for Bangla speakers to use English-focused language models effectively. Both the BanglaSTEM dataset and the trained translation model are publicly released at https://huggingface.co/reyazul/BanglaSTEM-T5.
Problem

Research questions and friction points this paper is trying to address.

Existing Bangla-English translation systems struggle with technical terminology
Technical term mistranslation alters problem meaning and causes incorrect answers
BanglaSTEM addresses specialized vocabulary translation in STEM fields
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created parallel corpus for technical Bangla-English translation
Trained T5 model on curated STEM domain dataset
Improved translation accuracy for specialized technical terminology
🔎 Similar Papers
No similar papers found.
K
Kazi Reyazul Hasan
Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
M
Mubasshira Musarrat
Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
A. B. M. Alim Al Islam
A. B. M. Alim Al Islam
Professor, Department of CSE, Bangladesh University of Engineering and Technology (BUET)
HCICyber SecurityAI & MLIoTIntelligent Transportation
Muhammad Abdullah Adnan
Muhammad Abdullah Adnan
Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh
Cloud ComputingDistributed ComputingDistributed Machine LearningArtificial IntelligenceNLP