🤖 AI Summary
This study addresses the robustness deficiencies of machine translation (MT) and quality estimation (QE) models when translating user-generated content (UGC) with sentiment-bearing homophones in Chinese. We identify semantic and affective ambiguity arising from phonetic similarity but divergent sentiment polarity as a critical challenge. To this end, we propose the first information-theoretic, self-information–guided method for automatically constructing sentiment-oriented homophone pairs. Our approach integrates homophone mining, expert validation, and multi-model probing to systematically assess MT systems and mainstream QE models—including multitask models, fine-tuned multilingual models, and large language models (LLMs)—on inputs that are phonetically identical yet sentimentally distinct. We release an open-source homophone-perturbed dataset and a human-annotated translation corpus. Experiments demonstrate that our method achieves significantly higher correlation with human judgments than baseline approaches, and further confirm a positive correlation between LLM scale and robustness in sentiment-preserving translation.
📝 Abstract
Evaluating machine translation (MT) of user-generated content (UGC) involves unique challenges such as checking whether the nuance of emotions from the source are preserved in the target text. Recent studies have proposed emotion-related datasets, frameworks and models to automatically evaluate MT quality of Chinese UGC, without relying on reference translations. However, whether these models are robust to the challenge of preserving emotional nuances has been left largely unexplored. To address this gap, we introduce a novel method inspired by information theory which generates challenging Chinese homophone words related to emotions, by leveraging the concept of self-information. Our approach generates homophones that were observed to cause translation errors in emotion preservation, and exposes vulnerabilities in MT systems and their evaluation methods when tackling emotional UGC. We evaluate the efficacy of our method using human evaluation for the quality of these generated homophones, and compare it with an existing one, showing that our method achieves higher correlation with human judgments. The generated Chinese homophones, along with their manual translations, are utilized to generate perturbations and to probe the robustness of existing quality evaluation models, including models trained using multi-task learning, fine-tuned variants of multilingual language models, as well as large language models (LLMs). Our results indicate that LLMs with larger size exhibit higher stability and robustness to such perturbations. We release our data and code for reproducibility and further research.