🤖 AI Summary
Existing multilingual machine translation benchmarks inadequately detect translation hallucinations in large language models (LLMs). To address this, we introduce HalloMTBench—the first hallucination-diagnostic multilingual benchmark—covering 11 English-to-foreign language directions with 5,435 human-verified high-quality instances. Methodologically, we decouple hallucinations into two orthogonal categories: *instruction deviation* and *source deviation*, and identify novel triggering mechanisms—including model scale effects, input-length sensitivity, linguistic bias, and RLHF-induced code-mixing. Candidate translations are generated by state-of-the-art LLMs and rigorously validated via an ensemble of LLM judges augmented with expert annotation to ensure high fidelity. Systematic evaluation across 17 mainstream LLMs reveals cross-lingual hallucination patterns and underlying causes. HalloMTBench establishes a reproducible, scalable, and forward-looking diagnostic platform for advancing research on reliability in multilingual machine translation.
📝 Abstract
Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers'' -- unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified language mixing. HalloMTBench offers a forward-looking testbed for diagnosing LLM translation failures. HalloMTBench is available in https://huggingface.co/collections/AIDC-AI/marco-mt.