🤖 AI Summary
This work addresses the susceptibility of large language models to hallucination in multi-turn dialogues, where context accumulation and error propagation often lead to factually unreliable outputs—particularly in high-stakes domains such as law, scientific research, healthcare, and programming. To tackle this challenge, the authors propose the first multi-turn hallucination evaluation framework that integrates inline citations with fully automated web-based evidence retrieval. They construct a benchmark comprising 950 seed questions spanning the four aforementioned domains, supporting parsing of full-text source materials (e.g., PDFs) and enabling fine-grained hallucination detection. Experimental results reveal that even state-of-the-art models like Opus-4.5 exhibit a hallucination rate of approximately 30% despite access to web search, highlighting significant influences of model capability, dialogue turn depth, reasoning quality, and knowledge type on hallucinatory behavior.
📝 Abstract
Large language models (LLMs) still produce plausible-sounding but ungrounded factual claims, a problem that worsens in multi-turn dialogue as context grows and early errors cascade. We introduce $\textbf{HalluHard}$, a challenging multi-turn hallucination benchmark with 950 seed questions spanning four high-stakes domains: legal cases, research questions, medical guidelines, and coding. We operationalize groundedness by requiring inline citations for factual assertions. To support reliable evaluation in open-ended settings, we propose a judging pipeline that iteratively retrieves evidence via web search. It can fetch, filter, and parse full-text sources (including PDFs) to assess whether cited material actually supports the generated content. Across a diverse set of frontier proprietary and open-weight models, hallucinations remain substantial even with web search ($\approx 30\%$ for the strongest configuration, Opus-4.5 with web search), with content-grounding errors persisting at high rates. Finally, we show that hallucination behavior is shaped by model capacity, turn position, effective reasoning, and the type of knowledge required.