LLMs for Low-Resource Dialect Translation Using Context-Aware Prompting: A Case Study on Sylheti

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource dialects—such as Sylheti, a Bengali variant—suffer from pervasive lexical gaps, semantic inaccuracy, and hallucination in large language model (LLM) translation. To address these challenges, this paper proposes Sylheti-CAP, a context-aware prompting framework that innovatively integrates three components: (1) linguistic rule injection, (2) bilingual dictionary-guided decoding, and (3) human-verified factual grounding. It presents the first systematic evaluation of leading LLMs—including GPT-4.1, LLaMA-4, Grok-3, and DeepSeek-V3.2—on Sylheti translation. Experimental results demonstrate that Sylheti-CAP significantly outperforms baseline methods across BLEU (+8.4), chrF (+12.6), and human evaluations (fluency, faithfulness, dialectal appropriateness), reducing erroneous expressions by 37.2%. Its modular architecture exhibits strong generalizability to other low-resource dialects, offering a novel paradigm for dialect preservation and LLM localization.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated strong translation abilities through prompting, even without task-specific training. However, their effectiveness in dialectal and low-resource contexts remains underexplored. This study presents the first systematic investigation of LLM-based machine translation (MT) for Sylheti, a dialect of Bangla that is itself low-resource. We evaluate five advanced LLMs (GPT-4.1, GPT-4.1, LLaMA 4, Grok 3, and DeepSeek V3.2) across both translation directions (Bangla $Leftrightarrow$ Sylheti), and find that these models struggle with dialect-specific vocabulary. To address this, we introduce Sylheti-CAP (Context-Aware Prompting), a three-step framework that embeds a linguistic rulebook, a dictionary (2{,}260 core vocabulary items and idioms), and an authenticity check directly into prompts. Extensive experiments show that Sylheti-CAP consistently improves translation quality across models and prompting strategies. Both automatic metrics and human evaluations confirm its effectiveness, while qualitative analysis reveals notable reductions in hallucinations, ambiguities, and awkward phrasing, establishing Sylheti-CAP as a scalable solution for dialectal and low-resource MT. Dataset link: href{https://github.com/TabiaTanzin/LLMs-for-Low-Resource-Dialect-Translation-Using-Context-Aware-Prompting-A-Case-Study-on-Sylheti.git}{https://github.com/TabiaTanzin/LLMs-for-Low-Resource-Dialect-Translation-Using-Context-Aware-Prompting-A-Case-Study-on-Sylheti.git}
Problem

Research questions and friction points this paper is trying to address.

Investigates LLM translation for low-resource Sylheti dialect.
Addresses dialect-specific vocabulary challenges in LLM-based translation.
Proposes a context-aware prompting framework to improve translation quality.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses a three-step context-aware prompting framework
Embeds linguistic rules and dictionary into prompts
Reduces hallucinations and ambiguities in translations
🔎 Similar Papers
No similar papers found.
Tabia Tanzin Prama
Tabia Tanzin Prama
Phd Student of Computer Science
Data MiningNLPHealth InformaticsAI Ethics
C
Christopher M. Danforth
Computational Story Lab, Vermont Complex Systems Institute, Vermont Advanced Computing Center, Department of Mathematics and Statistics, University of Vermont, Burlington, VT 05405, USA
Peter Sheridan Dodds
Peter Sheridan Dodds
Professor/Director, Computational Story Lab, Vermont Complex Systems Institute, UVM
LanguageMeaningStoriesSociotechnical PhenomenaComplex Systems