🤖 AI Summary
This study addresses the tendency of legal AI systems to issue confident rulings despite insufficient evidence—a phenomenon termed “factual hallucination”—which is particularly problematic in domains like unemployment insurance adjudication that require careful assessment of evidentiary sufficiency. To tackle this issue, the authors introduce the first benchmark for evaluating AI decision-making under information incompleteness and propose SPEC, a structured prompting framework that compels models to explicitly identify missing information before rendering a judgment, thereby enabling justified abstention when evidence is inadequate. Experiments using official legal materials and a RAG-based large language model platform demonstrate that SPEC achieves an overall accuracy of 89%, substantially outperforming baseline methods by 15%. The approach effectively balances accurate determinations with necessary deferral, laying groundwork for trustworthy AI that augments rather than replaces human judgment.
📝 Abstract
A well-known limitation of AI systems is presumptuousness: the tendency of AI systems to provide confident answers when information may be lacking. This challenge is particularly acute in legal applications, where a core task for attorneys, judges, and administrators is to determine whether evidence is sufficient to reach a conclusion. We study this problem in the important setting of unemployment insurance adjudication, which has seen rapid integration of AI systems and where the question of additional fact-finding poses the most significant bottleneck for a system that affects millions of applicants annually. First, through a collaboration with the Colorado Department of Labor and Employment, we secure rare access to official training materials and guidance to design a novel benchmark that systematically varies in information completeness. Second, we evaluate four leading AI platforms and show that standard RAG-based approaches achieve an average of only 15% accuracy when information is insufficient. Third, advanced prompting methods improve accuracy on inconclusive cases but over-correct, withholding decisions even on clear cases. Fourth, we introduce a structured framework requiring explicit identification of missing information before any determination (SPEC, Structured Prompting for Evidence Checklists). SPEC achieves 89% overall accuracy, while appropriately deferring when evidence is insufficient -- demonstrating that presumptuousness in legal AI is systematic but addressable, and that doing so is a necessary step towards systems that reliably support, rather than supplant, human judgment wherever decisions must await sufficient evidence.