🤖 AI Summary
Existing legal judgment prediction (LJP) datasets oversimplify the task and lack high-quality, interpretable annotations reflective of authentic judicial reasoning. To address this, we introduce AnnoCaseLaw—the first explainable LJP dataset focused on U.S. appellate negligence cases, comprising 471 expert-annotated cases. Legal domain experts collaboratively produced fine-grained, jurisprudentially grounded multi-layer annotations covering three interrelated tasks: judgment prediction, legal concept identification, and factual scenario labeling. AnnoCaseLaw is the first dataset to systematically integrate these tasks, thereby establishing a benchmark for explainable LJP. We develop multitask baselines using state-of-the-art large language models (LLMs) and empirically demonstrate that precedent application constitutes the core challenge in LJP. Our analysis further reveals fundamental limitations of current LLMs in judicial logical generalization and precedent citation—highlighting critical gaps in their legal reasoning capabilities.
📝 Abstract
Legal systems worldwide continue to struggle with overwhelming caseloads, limited judicial resources, and growing complexities in legal proceedings. Artificial intelligence (AI) offers a promising solution, with Legal Judgment Prediction (LJP) -- the practice of predicting a court's decision from the case facts -- emerging as a key research area. However, existing datasets often formulate the task of LJP unrealistically, not reflecting its true difficulty. They also lack high-quality annotation essential for legal reasoning and explainability. To address these shortcomings, we introduce AnnoCaseLaw, a first-of-its-kind dataset of 471 meticulously annotated U.S. Appeals Court negligence cases. Each case is enriched with comprehensive, expert-labeled annotations that highlight key components of judicial decision making, along with relevant legal concepts. Our dataset lays the groundwork for more human-aligned, explainable LJP models. We define three legally relevant tasks: (1) judgment prediction; (2) concept identification; and (3) automated case annotation, and establish a performance baseline using industry-leading large language models (LLMs). Our results demonstrate that LJP remains a formidable task, with application of legal precedent proving particularly difficult. Code and data are available at https://github.com/anonymouspolar1/annocaselaw.