🤖 AI Summary
Verb Multi-Word Expressions (VMWEs)—including verb idioms, phrasal verbs, and light-verb constructions—pose significant challenges for machine translation (MT) due to their non-compositional semantics, frequently causing inaccuracies in multilingual MT systems. Method: We propose an LLM-based VMWE rewriting approach that automatically substitutes non-literal VMWEs with semantically equivalent literal paraphrases prior to MT decoding. Contribution/Results: Evaluated on multilingual parallel corpora and benchmark VMWE datasets across English→German/French/Spanish/Chinese directions, our method demonstrates that VMWEs substantially degrade overall translation quality (average BLEU reduction of 2.1–4.7 points). Rewriting yields statistically significant improvements—up to +3.9 BLEU—particularly for verb idioms and phrasal verbs. These results validate the efficacy of the “rewrite-then-translate” paradigm and establish a scalable, LLM-driven solution to the long-standing problem of translating non-compositional expressions.
📝 Abstract
Verbal multiword expressions (VMWEs) present significant challenges for natural language processing due to their complex and often non-compositional nature. While machine translation models have seen significant improvement with the advent of language models in recent years, accurately translating these complex linguistic structures remains an open problem. In this study, we analyze the impact of three VMWE categories -- verbal idioms, verb-particle constructions, and light verb constructions -- on machine translation quality from English to multiple languages. Using both established multiword expression datasets and sentences containing these language phenomena extracted from machine translation datasets, we evaluate how state-of-the-art translation systems handle these expressions. Our experimental results consistently show that VMWEs negatively affect translation quality. We also propose an LLM-based paraphrasing approach that replaces these expressions with their literal counterparts, demonstrating significant improvement in translation quality for verbal idioms and verb-particle constructions.