🤖 AI Summary
This paper addresses the gap between existing LLM-based legal understanding evaluations and real-world judicial practice. We introduce the first long-context benchmark grounded in 236 authentic U.S. Supreme Court case pairs, systematically evaluating models’ ability to identify *stare decisis* reversal relationships. Methodologically, we employ an open-ended task design coupled with controlled-variable experiments, focusing on three dimensions: temporal reasoning, logical inference, and domain-specific legal comprehension. Our analysis reveals three critical deficiencies across models: (1) insensitivity to historical context, (2) reliance on superficial lexical cues for shallow reasoning, and (3) violations of temporal logic—leading to substantial performance degradation on older cases. The core contribution is the first empirical demonstration of fundamental limitations in LLMs’ capacity for common-law reasoning—a cornerstone of adversarial legal systems—and a paradigm shift in legal AI evaluation from short-text, static benchmarks toward long-context, temporally grounded, high-stakes real-world scenarios.
📝 Abstract
Large language models (LLMs) with extended context windows show promise for complex legal reasoning tasks, yet their ability to understand long legal documents remains insufficiently evaluated. Developing long-context benchmarks that capture realistic, high-stakes tasks remains a significant challenge in the field, as most existing evaluations rely on simplified synthetic tasks that fail to represent the complexity of real-world document understanding. Overruling relationships are foundational to common-law doctrine and commonly found in judicial opinions. They provide a focused and important testbed for long-document legal understanding that closely resembles what legal professionals actually do. We present an assessment of state-of-the-art LLMs on identifying overruling relationships from U.S. Supreme Court cases using a dataset of 236 case pairs. Our evaluation reveals three critical limitations: (1) era sensitivity -- the models show degraded performance on historical cases compared to modern ones, revealing fundamental temporal bias in their training; (2) shallow reasoning -- models rely on shallow logical heuristics rather than deep legal comprehension; and (3) context-dependent reasoning failures -- models produce temporally impossible relationships in complex open-ended tasks despite maintaining basic temporal awareness in simple contexts. Our work contributes a benchmark that addresses the critical gap in realistic long-context evaluation, providing an environment that mirrors the complexity and stakes of actual legal reasoning tasks.