MUCH: A Multilingual Claim Hallucination Benchmark

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) often produce unreliable factual statements, yet existing uncertainty quantification (UQ) benchmarks lack multilingual support and fine-grained, token-level calibration. Method: We introduce MUCH—the first multilingual, statement-level UQ benchmark—covering German, French, Spanish, and Italian across four open-source LLMs. MUCH uniquely provides per-token generation logits to enable white-box UQ method development and proposes a lightweight, deterministic statement segmentation algorithm that incurs only 0.2% inference overhead, enabling efficient real-time monitoring. The benchmark is built via instruction-tuned model generation, multilingual human annotation, and systematic logits trajectory storage. Contribution/Results: Experiments reveal substantial room for improvement in both accuracy and efficiency of current UQ methods. MUCH establishes the first standardized, reproducible, and deployment-ready evaluation framework and dataset for reliability research in multilingual LLMs.

Technology Category

Application Category

📝 Abstract
Claim-level Uncertainty Quantification (UQ) is a promising approach to mitigate the lack of reliability in Large Language Models (LLMs). We introduce MUCH, the first claim-level UQ benchmark designed for fair and reproducible evaluation of future methods under realistic conditions. It includes 4,873 samples across four European languages (English, French, Spanish, and German) and four instruction-tuned open-weight LLMs. Unlike prior claim-level benchmarks, we release 24 generation logits per token, facilitating the development of future white-box methods without re-generating data. Moreover, in contrast to previous benchmarks that rely on manual or LLM-based segmentation, we propose a new deterministic algorithm capable of segmenting claims using as little as 0.2% of the LLM generation time. This makes our segmentation approach suitable for real-time monitoring of LLM outputs, ensuring that MUCH evaluates UQ methods under realistic deployment constraints. Finally, our evaluations show that current methods still have substantial room for improvement in both performance and efficiency.
Problem

Research questions and friction points this paper is trying to address.

Mitigating unreliable claims in multilingual LLMs through uncertainty quantification
Developing efficient claim segmentation for real-time LLM output monitoring
Benchmarking UQ methods under realistic multilingual deployment constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual benchmark for claim-level uncertainty quantification
Deterministic algorithm for real-time claim segmentation
Released token logits enabling white-box method development
🔎 Similar Papers
No similar papers found.
J
Jérémie Dentan
LIX (École Polytechnique, IP Paris, CNRC)
A
Alexi Canesse
LIX (École Polytechnique, IP Paris, CNRC)
Davide Buscaldi
Davide Buscaldi
Maître de conférences HDR, LIPN, Université Sorbonne Paris Nord
LLMsInformation RetrievalOntology LearningGeographic IRText Mining
A
Aymen Shabou
Crédit Agricole SA
S
Sonia Vanier
LIX (École Polytechnique, IP Paris, CNRC)