🤖 AI Summary
This work addresses the prevalent issue of hallucinatory completion in reasoning search agents, wherein tasks are prematurely deemed satisfied without sufficient verification against all constraints. The study systematically identifies four distinct failure modes underlying this problem and introduces LiveLedger, a lightweight, real-time constraint-tracking mechanism. Integrated with the Epistemic Ledger evaluation framework, LiveLedger explicitly records the evidential support and epistemic belief state for each constraint throughout the reasoning process. Empirical results demonstrate that this approach significantly mitigates hallucinatory completion, reducing inadequately verified answers by up to 26.5% and improving overall accuracy by as much as 11.6% in multi-turn reasoning scenarios involving tool integration.
📝 Abstract
Recent search agents leverage multi-turn reasoning and search tools to achieve strong performance on multi-hop and long-horizon benchmarks. Yet it remains unclear whether they reliably reason across all requirements by tracking, verifying, and maintaining multiple conditions in these questions. We study this capability under multi-constraint problems, where valid answers must satisfy several constraints simultaneously. We find that illusory completion frequently occurs, wherein agents believe tasks are complete despite unresolved or violated constraints, leading to underverified answers. To diagnose this behavior, we introduce the Epistemic Ledger, an evaluation framework that tracks evidential support and agents'beliefs for each constraint throughout multi-turn reasoning. Our analysis reveals four recurring failure patterns: bare assertions, overlooked refutations, stagnation, and premature exit. Motivated by these findings, we examine whether explicit constraint-state tracking during execution mitigates these failures via LiveLedger, an inference-time tracker. This simple intervention consistently improves performance, substantially reducing underverified answers (by up to 26.5%) and improving overall accuracy (by up to 11.6%) on multi-constraint problems.