π€ AI Summary
This study addresses the unclear capability of large language models (LLMs) in translating culturally sensitive content, despite their strong performance in general-purpose translation. To this end, the authors construct CanMT, the first culture-aware, novel-driven parallel corpus, and propose an integrated evaluation framework combining multidimensional human assessment with LLM-as-a-judge methodologies. The work systematically evaluates mainstream LLMs under diverse translation strategies, revealing a significant gap between modelsβ cultural knowledge recognition and their practical translation performance. Findings demonstrate that both model architecture and translation strategy systematically influence output quality, and that reference translations substantially enhance the accuracy of automatic evaluation metrics. This research establishes a benchmark dataset, theoretical foundation, and reliable evaluation paradigm for culture-aware machine translation.
π Abstract
Large language models (LLMs) have achieved strong performance in general machine translation, yet their ability in culture-aware scenarios remains poorly understood. To bridge this gap, we introduce CanMT, a Culture-Aware Novel-Driven Parallel Dataset for Machine Translation, together with a theoretically grounded, multi-dimensional evaluation framework for assessing cultural translation quality. Leveraging CanMT, we systematically evaluate a wide range of LLMs and translation systems under different translation strategy constraints. Our findings reveal substantial performance disparities across models and demonstrate that translation strategies exert a systematic influence on model behavior. Further analysis shows that translation difficulty varies across types of culture-specific items, and that a persistent gap remains between models' recognition of culture-specific knowledge and their ability to correctly operationalize it in translation outputs. In addition, incorporating reference translations is shown to substantially improve evaluation reliability in LLM-as-a-judge, underscoring their essential role in assessing culture-aware translation quality. The corpus and code are available at CanMT.