🤖 AI Summary
This study systematically investigates how human-plausible errors (e.g., spelling, grammatical deviations) and synthetic noise (character-level, composite-level) in user prompts affect large language models’ (LLMs) performance on machine translation (MT) and MT evaluation tasks. Using controlled noise injection, combined with quantitative evaluation and qualitative analysis, we find that prompt quality critically impacts model behavior: low-quality prompts primarily impair instruction following—not the intrinsic quality of translations—with composite and character-level noise proving most detrimental. A key contribution is the identification of “prompt robustness”: even when prompts are severely distorted to the point of human unreadability, LLMs retain basic translation capability, suggesting strong underlying semantic decoupling. These findings provide novel empirical evidence for prompt engineering, robustness assessment, and human–AI collaborative translation practice.
📝 Abstract
Large language models (LLMs) have achieved top results in recent machine translation evaluations, but they are also known to be sensitive to errors and perturbations in their prompts. We systematically evaluate how both humanly plausible and synthetic errors in user prompts affect LLMs' performance on two related tasks: Machine translation and machine translation evaluation. We provide both a quantitative analysis and qualitative insights into how the models respond to increasing noise in the user prompt.
The prompt quality strongly affects the translation performance: With many errors, even a good prompt can underperform a minimal or poor prompt without errors. However, different noise types impact translation quality differently, with character-level and combined noisers degrading performance more than phrasal perturbations. Qualitative analysis reveals that lower prompt quality largely leads to poorer instruction following, rather than directly affecting translation quality itself. Further, LLMs can still translate in scenarios with overwhelming random noise that would make the prompt illegible to humans.