🤖 AI Summary
Large language model (LLM) agents suffer from pervasive hallucination—stemming from distorted perception of instructions, interaction history, or environmental states—yet existing evaluations remain fragmented and lack systematic benchmarks. Method: We introduce MIRAGE-Bench, the first hallucination benchmark tailored for interactive LLM agents. It (1) establishes a three-dimensional hallucination taxonomy—covering instruction understanding, history modeling, and environment perception; (2) employs a snapshot-based strategy to freeze decision points, ensuring reproducible test cases; and (3) proposes a risk-aware, fine-grained LLM-as-a-Judge evaluation paradigm, enhanced by domain-specific prompt engineering to improve judgment reliability. Results: Experiments uncover diverse, recurrent hallucination patterns, deliver actionable root-cause analysis, and suggest concrete mitigation strategies—laying the foundation for rigorous, trustworthy evaluation of LLM agents in real-world interactive settings.
📝 Abstract
Hallucinations pose critical risks for large language model (LLM)-based agents, often manifesting as hallucinative actions resulting from fabricated or misinterpreted information within the cognitive context. While recent studies have exposed such failures, existing evaluations remain fragmented and lack a principled testbed. In this paper, we present MIRAGE-Bench--Measuring Illusions in Risky AGEnt settings--the first unified benchmark for eliciting and evaluating hallucinations in interactive LLM-agent scenarios. We begin by introducing a three-part taxonomy to address agentic hallucinations: actions that are unfaithful to (i) task instructions, (ii) execution history, or (iii) environment observations. To analyze, we first elicit such failures by performing a systematic audit of existing agent benchmarks, then synthesize test cases using a snapshot strategy that isolates decision points in deterministic and reproducible manners. To evaluate hallucination behaviors, we adopt a fine-grained-level LLM-as-a-Judge paradigm with tailored risk-aware prompts, enabling scalable, high-fidelity assessment of agent actions without enumerating full action spaces. MIRAGE-Bench provides actionable insights on failure modes of LLM agents and lays the groundwork for principled progress in mitigating hallucinations in interactive environments.