🤖 AI Summary
Existing automatic evaluation metrics and large language models employed as judges (LLM-as-a-judge) struggle to accurately assess creativity in literary translation and exhibit significant discrepancies with professional human evaluations. This study constructs a literary translation dataset spanning three modalities, genres, and language pairs, annotated by professional translators for fine-grained creative features, and systematically evaluates the performance of both traditional automatic metrics and LLMs along this dimension. The work reveals, for the first time, that LLMs display a systematic preference for machine-translated outputs and frequently misjudge culturally appropriate creative expressions—particularly in highly literary texts such as poetry, where correlations between current evaluation methods and human judgments markedly decline. These findings underscore the urgent need for novel evaluation frameworks capable of accommodating unconventional yet valid literary translations.
📝 Abstract
This article investigates the performance of automatic evaluation metrics (AEMs) and LLM-as-a-judge evaluation on literary translation across multiple languages, genres, and translation modalities. The aim is to assess how well these tools align with professionals when evaluating translation, creativity (creative shifts & errors), and see if they can substitute laborious manual annotations. A dataset of literary translations across three modalities (human translation, machine translation, and post-editing), three genres and three language pairs was created and annotated in detail for creativity by experienced professional literary translators. The results show that both AEMs and LLM-as-a-judge evaluations correlate poorly with professional evaluations on creativity, with LLM-as-a-judge showing a systematic bias in favour of machine-translated texts and penalising creative and culturally appropriate solutions. Moreover, performance is consistently worse for more literary genres such as poetry. This highlights fundamental limitations of current automatic evaluation tools for literary translation and the need to create new tools that do not frequently consider out of routine translations as errors.