Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation

📅 2026-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing machine translation evaluation metrics exhibit insufficient accuracy in non-literal translation scenarios—such as social media and literary texts—making it difficult to reliably assess translation quality. To address this limitation, this work introduces MENT, the first meta-evaluation benchmark dataset specifically designed for non-literal translation, and proposes RATE, a reflective agent framework powered by large language models (LLMs). RATE dynamically orchestrates specialized sub-agents to perform fine-grained, context-aware evaluation. The framework effectively overcomes key limitations of conventional metrics and LLM-as-a-Judge approaches, particularly knowledge cutoff and scoring inconsistency. On the MENT benchmark, RATE achieves a minimum improvement of 3.2 meta-score points over existing methods while maintaining strong robustness in general-purpose translation evaluation tasks.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have significantly advanced Machine Translation (MT), applying them to linguistically complex domains-such as Social Network Services, literature etc. In these scenarios, translations often require handling non-literal expressions, leading to the inaccuracy of MT metrics. To systematically investigate the reliability of MT metrics, we first curate a meta-evaluation dataset focused on non-literal translations, namely MENT. MENT encompasses four non-literal translation domains and features source sentences paired with translations from diverse MT systems, with 7,530 human-annotated scores on translation quality. Experimental results reveal the inaccuracies of traditional MT metrics and the limitations of LLM-as-a-Judge, particularly the knowledge cutoff and score inconsistency problem. To mitigate these limitations, we propose RATE, a novel agentic translation evaluation framework, centered by a reflective Core Agent that dynamically invokes specialized sub-agents. Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics. Further experiments demonstrate the robustness of RATE to general-domain MT evaluation. Code and dataset are available at: https://github.com/BITHLP/RATE.
Problem

Research questions and friction points this paper is trying to address.

non-literal translation
machine translation evaluation
translation metrics
LLM-as-a-Judge
meta-evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

non-literal translation
translation evaluation
large language models
agentic framework
meta-evaluation dataset
🔎 Similar Papers
No similar papers found.
Yanzhi Tian
Yanzhi Tian
Beijing Instituite of Technology
Machine TranslationLarge Language ModelsVision Language Models
Cunxiang Wang
Cunxiang Wang
Tsinghua University; ZhipuAI
Large Language ModelsLLM EvaluationLLM Post-training
Z
Zeming Liu
School of Computer Science and Engineering, Beihang University
H
Heyan Huang
School of Computer Science and Technology, Beijing Institute of Technology
W
Wenbo Yu
Zhipu AI
D
Dawei Song
School of Computer Science and Technology, Beijing Institute of Technology
Jie Tang
Jie Tang
UW Madison
Computed Tomography
Yuhang Guo
Yuhang Guo
Beijing Institute of Technology
Natural Language Processing