A Comparative Analysis of Retrieval-Augmented Generation Techniques for Bengali Standard-to-Dialect Machine Translation Using LLMs

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the low-resource machine translation challenge from Standard Bengali to six regional dialects—characterized by data scarcity and high linguistic variation—this work proposes two zero-shot, fine-tuning-free retrieval-augmented generation (RAG) paradigms. First, context retrieval leverages phonetically transcribed dialectal texts; second, a novel structured dialect–standard sentence-pair retrieval method enables precise, semantics-aware matching. Crucially, our structured retrieval demonstrates—for the first time—that high-quality retrieved contexts can enable a small language model (Llama-3.1-8B) to outperform an ultra-large open-source model (GPT-OSS-120B), challenging the prevailing scale-dependency assumption. Evaluated via multi-LLM assessment and comprehensive metrics—including BLEU, ChrF, BERTScore, and WER—we achieve substantial improvements, e.g., reducing WER from 76% to 55% on the Chittagong dialect. This work establishes a zero-fine-tuning, highly generalizable, and reusable technical framework for low-resource dialect preservation and translation.

Technology Category

Application Category

📝 Abstract
Translating from a standard language to its regional dialects is a significant NLP challenge due to scarce data and linguistic variation, a problem prominent in the Bengali language. This paper proposes and compares two novel RAG pipelines for standard-to-dialectal Bengali translation. The first, a Transcript-Based Pipeline, uses large dialect sentence contexts from audio transcripts. The second, a more effective Standardized Sentence-Pairs Pipeline, utilizes structured local_dialect:standard_bengali sentence pairs. We evaluated both pipelines across six Bengali dialects and multiple LLMs using BLEU, ChrF, WER, and BERTScore. Our findings show that the sentence-pair pipeline consistently outperforms the transcript-based one, reducing Word Error Rate (WER) from 76% to 55% for the Chittagong dialect. Critically, this RAG approach enables smaller models (e.g., Llama-3.1-8B) to outperform much larger models (e.g., GPT-OSS-120B), demonstrating that a well-designed retrieval strategy can be more crucial than model size. This work contributes an effective, fine-tuning-free solution for low-resource dialect translation, offering a practical blueprint for preserving linguistic diversity.
Problem

Research questions and friction points this paper is trying to address.

Develops RAG pipelines for Bengali standard-to-dialect translation.
Compares transcript-based and sentence-pair methods across six dialects.
Enables smaller models to outperform larger ones via retrieval strategy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses transcript-based RAG with dialect sentence contexts
Implements sentence-pair RAG with structured dialect-standard pairs
Enables small models to outperform larger ones via retrieval
🔎 Similar Papers
No similar papers found.
K
K. M. Jubair Sami
Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh
D
Dipto Sumit
Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh
A
Ariyan Hossain
Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh
Farig Sadeque
Farig Sadeque
Associate Professor, BRAC University
Natural Language ProcessingComputational Social Science