Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing multilingual machine translation benchmarks inadequately detect translation hallucinations in large language models (LLMs). To address this, we introduce HalloMTBench—the first hallucination-diagnostic multilingual benchmark—covering 11 English-to-foreign language directions with 5,435 human-verified high-quality instances. Methodologically, we decouple hallucinations into two orthogonal categories: *instruction deviation* and *source deviation*, and identify novel triggering mechanisms—including model scale effects, input-length sensitivity, linguistic bias, and RLHF-induced code-mixing. Candidate translations are generated by state-of-the-art LLMs and rigorously validated via an ensemble of LLM judges augmented with expert annotation to ensure high fidelity. Systematic evaluation across 17 mainstream LLMs reveals cross-lingual hallucination patterns and underlying causes. HalloMTBench establishes a reproducible, scalable, and forward-looking diagnostic platform for advancing research on reliability in multilingual machine translation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers'' -- unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified language mixing. HalloMTBench offers a forward-looking testbed for diagnosing LLM translation failures. HalloMTBench is available in https://huggingface.co/collections/AIDC-AI/marco-mt.

Problem

Research questions and friction points this paper is trying to address.

Developing taxonomy and benchmark to detect translation hallucinations in multilingual LLMs

Creating human-verified multilingual dataset to expose LLM translation failures

Identifying hallucination triggers and failure patterns in machine translation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced taxonomy separating Instruction and Source Detachment

Created multilingual benchmark HalloMTBench across 11 directions

Used ensemble of LLM judges and expert validation

🔎 Similar Papers

Hallucination of Multimodal Large Language Models: A Survey