HalluHard: A Hard Multi-Turn Hallucination Benchmark

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the susceptibility of large language models to hallucination in multi-turn dialogues, where context accumulation and error propagation often lead to factually unreliable outputs—particularly in high-stakes domains such as law, scientific research, healthcare, and programming. To tackle this challenge, the authors propose the first multi-turn hallucination evaluation framework that integrates inline citations with fully automated web-based evidence retrieval. They construct a benchmark comprising 950 seed questions spanning the four aforementioned domains, supporting parsing of full-text source materials (e.g., PDFs) and enabling fine-grained hallucination detection. Experimental results reveal that even state-of-the-art models like Opus-4.5 exhibit a hallucination rate of approximately 30% despite access to web search, highlighting significant influences of model capability, dialogue turn depth, reasoning quality, and knowledge type on hallucinatory behavior.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) still produce plausible-sounding but ungrounded factual claims, a problem that worsens in multi-turn dialogue as context grows and early errors cascade. We introduce $\textbf{HalluHard}$, a challenging multi-turn hallucination benchmark with 950 seed questions spanning four high-stakes domains: legal cases, research questions, medical guidelines, and coding. We operationalize groundedness by requiring inline citations for factual assertions. To support reliable evaluation in open-ended settings, we propose a judging pipeline that iteratively retrieves evidence via web search. It can fetch, filter, and parse full-text sources (including PDFs) to assess whether cited material actually supports the generated content. Across a diverse set of frontier proprietary and open-weight models, hallucinations remain substantial even with web search ($\approx 30\%$ for the strongest configuration, Opus-4.5 with web search), with content-grounding errors persisting at high rates. Finally, we show that hallucination behavior is shaped by model capacity, turn position, effective reasoning, and the type of knowledge required.

Problem

Research questions and friction points this paper is trying to address.

hallucination

multi-turn dialogue

large language models

factual grounding

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination benchmark

multi-turn dialogue

inline citations