When the Gold Standard isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Non-standard linguistic phenomena in user-generated content (UGC)—including misspellings, slang, character repetition, and emojis—undermine the consistency and fairness of translation quality evaluation, hindering equitable assessment of models and metrics. Method: We introduce the first fine-grained taxonomy encompassing 12 types of non-standard phenomena and 5 translation operations; propose a novel “guideline-aware controllable evaluation” paradigm that explicitly aligns translation objectives with human-authored evaluation guidelines; and conduct guideline analysis, LLM prompt sensitivity experiments, qualitative modeling, and cross-dataset validation. Contribution/Results: We demonstrate that LLM-based scores are highly sensitive to explicit instruction—when prompts conform to dataset-specific guidelines, BLEU and COMET scores improve by up to 12.3%. This work establishes a consensus evaluation framework anchored in translation guidelines, advancing standardized, interpretable, and guideline-grounded assessment for UGC translation.

Technology Category

Application Category

📝 Abstract

User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation particularly challenging: what counts as a "good" translation depends on the level of standardness desired in the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. Through a case study on large language models (LLMs), we show that translation scores are highly sensitive to prompts with explicit translation instructions for UGC, and that they improve when these align with the dataset's guidelines. We argue that when preserving UGC style is important, fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating translation quality for non-standard user-generated content

Developing guidelines for handling slang, errors, and emojis in translation

Creating controllable evaluation frameworks aligned with translation guidelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Derived taxonomy of non-standard phenomena and translation actions

Showed LLM scores improve with guideline-aligned prompts

Called for controllable, guideline-aware evaluation frameworks

🔎 Similar Papers

No similar papers found.