MultiHaluDet: Multilingual Hallucination Detection via LLM Hidden State Probing

📅 2026-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of hallucination detection in large language models (LLMs) for low-resource languages, where existing methods struggle to identify cross-lingual factual inconsistencies effectively. The authors propose a novel three-stage stacking framework that requires no language-specific fine-tuning. By probing the full-layer hidden state trajectories of frozen LLMs, the method uniquely integrates multi-scale attention with self-attention pooling to model cross-layer representations. It further combines cross-validated embeddings and a calibrated classifier to jointly capture both fine-grained and coarse-grained hallucination patterns. Evaluated on English benchmarks HaluEval and TriviaQA, the approach achieves an AUROC of 98.55% and consistently outperforms current baselines across diverse languages—including French, Bengali, and Amharic—demonstrating strong cross-lingual generalization capabilities regardless of resource availability.
📝 Abstract
Hallucinations in Large Language Models (LLMs) represent a critical barrier to their reliable deployment, a vulnerability heavily exacerbated in non-English and resource-constrained contexts. Existing detection approaches that rely on output confidence heuristics or single-layer internal representations frequently fail to capture deep, complex factual inconsistencies across diverse languages. To address this, we introduce MultiHaluDet, a novel three-stage stacking framework that detects multilingual hallucinations by probing the full hidden state trajectories of frozen LLMs without requiring language-specific fine-tuning. Our method extracts sequential features across multiple layers and processes them via a hybrid architecture using multi-scale attention and self-attention pooling. By generating out-of-fold embeddings that feed into a calibrated classical classifier ensemble, MultiHaluDet captures both fine-grained and coarse-grained patterns of factual inconsistency. Extensive experiments demonstrate that our framework achieves state-of-the-art detection performance, reaching up to 98.55% AUROC on the English HaluEval and TriviaQA benchmarks using Mistral-7B and LLaMA2-7B architectures. Crucially, we rigorously evaluate our framework's cross-lingual generalization across high (French), medium (Bangla), and low-resource (Amharic) languages. MultiHaluDet demonstrates exceptional representational robustness, consistently outperforming baselines and successfully transferring hallucination detection capabilities across typologically diverse linguistic tiers.
Problem

Research questions and friction points this paper is trying to address.

hallucination detection
multilingual
large language models
cross-lingual generalization
factual inconsistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual hallucination detection
hidden state probing
frozen LLMs
cross-lingual generalization
multi-scale attention