MUCH: A Multilingual Claim Hallucination Benchmark

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Large language models (LLMs) often produce unreliable factual statements, yet existing uncertainty quantification (UQ) benchmarks lack multilingual support and fine-grained, token-level calibration. Method: We introduce MUCH—the first multilingual, statement-level UQ benchmark—covering German, French, Spanish, and Italian across four open-source LLMs. MUCH uniquely provides per-token generation logits to enable white-box UQ method development and proposes a lightweight, deterministic statement segmentation algorithm that incurs only 0.2% inference overhead, enabling efficient real-time monitoring. The benchmark is built via instruction-tuned model generation, multilingual human annotation, and systematic logits trajectory storage. Contribution/Results: Experiments reveal substantial room for improvement in both accuracy and efficiency of current UQ methods. MUCH establishes the first standardized, reproducible, and deployment-ready evaluation framework and dataset for reliability research in multilingual LLMs.

Technology Category

Application Category

📝 Abstract

Claim-level Uncertainty Quantification (UQ) is a promising approach to mitigate the lack of reliability in Large Language Models (LLMs). We introduce MUCH, the first claim-level UQ benchmark designed for fair and reproducible evaluation of future methods under realistic conditions. It includes 4,873 samples across four European languages (English, French, Spanish, and German) and four instruction-tuned open-weight LLMs. Unlike prior claim-level benchmarks, we release 24 generation logits per token, facilitating the development of future white-box methods without re-generating data. Moreover, in contrast to previous benchmarks that rely on manual or LLM-based segmentation, we propose a new deterministic algorithm capable of segmenting claims using as little as 0.2% of the LLM generation time. This makes our segmentation approach suitable for real-time monitoring of LLM outputs, ensuring that MUCH evaluates UQ methods under realistic deployment constraints. Finally, our evaluations show that current methods still have substantial room for improvement in both performance and efficiency.

Problem

Research questions and friction points this paper is trying to address.

Mitigating unreliable claims in multilingual LLMs through uncertainty quantification

Developing efficient claim segmentation for real-time LLM output monitoring

Benchmarking UQ methods under realistic multilingual deployment constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual benchmark for claim-level uncertainty quantification

Deterministic algorithm for real-time claim segmentation

Released token logits enabling white-box method development

🔎 Similar Papers

Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models