Proverbs Run in Pairs: Evaluating Proverb Translation Capability of Large Language Model

📅 2025-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of evaluating machine translation quality for culture-loaded proverbs. Existing automatic metrics (e.g., BLEU, chrF++, COMET) exhibit severe inadequacy in assessing cultural adaptation. To remedy this, we introduce ProverbMT—the first open-source, multilingual proverb translation benchmark—covering four language pairs and two evaluation scenarios: isolated proverb translation and context-embedded translation. We systematically evaluate state-of-the-art neural machine translation (NMT) models (M2M-100, NLLB) and large language models (LLMs) (Llama, Qwen, GLM) under zero-shot and fine-tuned settings. Results show that LLMs consistently outperform NMT models by 12–28% in accuracy; translation quality is higher for culturally proximate language pairs; and conventional metrics correlate poorly (<0.3) with human judgments. We establish the necessity of culture-aware evaluation and propose a reproducible, fine-grained evaluation paradigm specifically for proverb translation.

Technology Category

Application Category

📝 Abstract
Despite achieving remarkable performance, machine translation (MT) research remains underexplored in terms of translating cultural elements in languages, such as idioms, proverbs, and colloquial expressions. This paper investigates the capability of state-of-the-art neural machine translation (NMT) and large language models (LLMs) in translating proverbs, which are deeply rooted in cultural contexts. We construct a translation dataset of standalone proverbs and proverbs in conversation for four language pairs. Our experiments show that the studied models can achieve good translation between languages with similar cultural backgrounds, and LLMs generally outperform NMT models in proverb translation. Furthermore, we find that current automatic evaluation metrics such as BLEU, CHRF++ and COMET are inadequate for reliably assessing the quality of proverb translation, highlighting the need for more culturally aware evaluation metrics.
Problem

Research questions and friction points this paper is trying to address.

Neural Machine Translation
Cultural Elements
Translation Quality Evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cultural-aware Translation
Large Language Models
Evaluation Metrics for Idiomatic Expressions
🔎 Similar Papers
No similar papers found.
M
Minghan Wang
Department of Data Science & AI, Monash University
V
Viet Pham
Department of Data Science & AI, Monash University
Farhad Moghimifar
Farhad Moghimifar
Applied Scientist @ Oracle
Natural Language ProcessingMachine LearningCausality
Thuy-Trang Vu
Thuy-Trang Vu
Monash University
Natural Language ProcessingMachine Learning