When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing hallucination detection benchmarks are limited to English, support only sequence-level evaluation, and lack fine-grained multilingual supervision. To address these limitations, we introduce PsiloQA—the first large-scale, multilingual, span-level hallucination detection dataset, covering 14 languages and enabling cross-lingual, token- or span-level factual assessment. PsiloQA is constructed via a GPT-4o–driven, three-stage automated pipeline, drastically reducing annotation cost while ensuring high scalability and strong cross-lingual generalization. Experimental results show that fine-tuned encoder models achieve optimal performance on multilingual hallucination detection. Moreover, PsiloQA enables effective transfer to other benchmarks (e.g., FEVER, Factool), enhancing their multilingual factuality evaluation capability. This work establishes a new benchmark and methodology for scalable, fine-grained, multilingual hallucination detection.

Technology Category

Application Category

📝 Abstract
Hallucination detection remains a fundamental challenge for the safe and reliable deployment of large language models (LLMs), especially in applications requiring factual accuracy. Existing hallucination benchmarks often operate at the sequence level and are limited to English, lacking the fine-grained, multilingual supervision needed for a comprehensive evaluation. In this work, we introduce PsiloQA, a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages. PsiloQA is constructed through an automated three-stage pipeline: generating question-answer pairs from Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse LLMs in a no-context setting, and automatically annotating hallucinated spans using GPT-4o by comparing against golden answers and retrieved context. We evaluate a wide range of hallucination detection methods -- including uncertainty quantification, LLM-based tagging, and fine-tuned encoder models -- and show that encoder-based models achieve the strongest performance across languages. Furthermore, PsiloQA demonstrates effective cross-lingual generalization and supports robust knowledge transfer to other benchmarks, all while being significantly more cost-efficient than human-annotated datasets. Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.
Problem

Research questions and friction points this paper is trying to address.

Detecting span-level hallucinations in multilingual LLM outputs
Addressing limitations of English-only sequence-level hallucination benchmarks
Developing cost-effective automated annotation for fine-grained hallucination detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated span-level hallucination annotation pipeline
Multilingual dataset covering 14 diverse languages
Encoder models achieve strongest cross-lingual performance
🔎 Similar Papers
No similar papers found.
E
Elisei Rykov
Skoltech
K
Kseniia Petrushina
Skoltech,Moscow Institute of Physics and Technology
M
Maksim Savkin
AIRI,Moscow Institute of Physics and Technology
V
Valerii Olisov
Moscow Institute of Physics and Technology
Artem Vazhentsev
Artem Vazhentsev
Independent Researcher
deep learningNLPuncertainty estimation
K
Kseniia Titova
MWS AI,Skoltech
Alexander Panchenko
Alexander Panchenko
Associate Professor for Natural Language Processing
natural language processingword sense disambiguationtext style transferargument mininggraph
Vasily Konovalov
Vasily Konovalov
Unknown affiliation
Natural Language ProcessingMachine LearningDialogue Systems
J
Julia Belikova
Sber AI Lab,Skoltech