Beyond Reproduction: A Paired-Task Framework for Assessing LLM Comprehension and Creativity in Literary Translation

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study addresses the common practice in existing research of evaluating large language models’ source-text comprehension and translational creativity in literary translation as separate dimensions, alongside the lack of systematic metrics for creativity. The authors propose a paired-task framework that jointly assesses both capabilities using excerpts from 11 literary works and introduces “Units of Creative Potential” (UCPs)—such as metaphors and wordplay—to quantify translational creativity. Combining expert annotations with UCP-based automated scoring, they evaluate 23 models and four creative prompting strategies. Results reveal that strong comprehension does not necessarily entail high creativity, with most models scoring near zero on creativity; only Mistral-Large approaches human-level performance (0.167 vs. 0.246), and the gap is particularly pronounced in English-to-Chinese translation.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly used for creative tasks such as literary translation. Yet translational creativity remains underexplored and is rarely evaluated at scale, while source-text comprehension is typically studied in isolation, despite the fact that, in professional translation, comprehension and creativity are tightly intertwined. We address these gaps with a paired-task framework applied to literary excerpts from 11 books. Task 1 assesses source-text comprehension, and Task 2 evaluates translational creativity through Units of Creative Potential (UCPs), such as metaphors and wordplay. Using a scalable evaluation setup that combines expert human annotations with UCP-based automatic scoring, we benchmark 23 models and four creativity-oriented prompts. Our findings show that strong comprehension does not translate into human-level creativity: models often produce literal or contextually inappropriate renderings, with particularly large gaps for the more distant English-Chinese language pair. Creativity-oriented prompts yield only modest gains, and only one model, Mistral-Large, comes close to human-level creativity (0.167 vs. 0.246). Across all model-prompt combinations, only three exceed a creativity score of 0.1, while the rest remain at or near zero.

Problem

Research questions and friction points this paper is trying to address.

literary translation

large language models

comprehension

creativity

evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

paired-task framework

Units of Creative Potential

literary translation