Evaluating Large Language Models on Urdu Idiom Translation

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Urdu idiom translation remains challenging due to low-resource constraints and intricate cultural-semantic dependencies, hindering machine translation performance. This work introduces the first bilingual-form (native Urdu script + Romanized) English-to-Urdu idiom translation evaluation dataset, enabling systematic assessment of large language models (LLMs) and neural machine translation (NMT) systems on cultural-semantic fidelity. We propose a multi-dimensional automatic evaluation framework—integrating BLEU, BERTScore, COMET, and XCOMET—combined with comparative prompt-engineering strategies. Results show that native-script input yields significantly higher translation quality than Romanized input; while prompt engineering consistently improves accuracy, performance gains across different prompt templates are marginal. This study is the first to empirically demonstrate the critical impact of textual representation on idiom translation quality, establishing a benchmark dataset and methodological foundation for culturally adaptive translation in low-resource languages.

Technology Category

Application Category

📝 Abstract
Idiomatic translation remains a significant challenge in machine translation, especially for low resource languages such as Urdu, and has received limited prior attention. To advance research in this area, we introduce the first evaluation datasets for Urdu to English idiomatic translation, covering both Native Urdu and Roman Urdu scripts and annotated with gold-standard English equivalents. We evaluate multiple open-source Large Language Models (LLMs) and Neural Machine Translation (NMT) systems on this task, focusing on their ability to preserve idiomatic and cultural meaning. Automatic metrics including BLEU, BERTScore, COMET, and XCOMET are used to assess translation quality. Our findings indicate that prompt engineering enhances idiomatic translation compared to direct translation, though performance differences among prompt types are relatively minor. Moreover, cross script comparisons reveal that text representation substantially affects translation quality, with Native Urdu inputs producing more accurate idiomatic translations than Roman Urdu.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on Urdu idiom translation challenges
Assessing translation quality across Native and Roman Urdu scripts
Analyzing prompt engineering impact on idiomatic meaning preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created first Urdu idiom evaluation datasets
Evaluated LLMs using multiple automatic metrics
Used prompt engineering to enhance translation quality
🔎 Similar Papers
No similar papers found.