AR-BENCH: Benchmarking Legal Reasoning with Judgment Error Detection, Classification and Correction

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Legal judgments are prone to errors due to case complexity and the abstract nature of legal norms, yet existing appellate mechanisms struggle to efficiently identify and correct post-hoc errors. This work introduces, for the first time, the “Appellate Review” task, which focuses on detecting, categorizing, and rectifying errors after a judgment has been issued. To support this endeavor, we construct AR-Bench, the first fine-grained benchmark dataset comprising 8,700 annotated judgments and 34,617 supplementary legal documents. Leveraging a structured error taxonomy and a large language model (LLM) evaluation framework, our empirical analysis reveals significant limitations of prevailing LLMs in identifying misapplications of law, thereby advancing legal AI from predictive generation toward reliability-oriented diagnosis.

Technology Category

Application Category

📝 Abstract
Legal judgments may contain errors due to the complexity of case circumstances and the abstract nature of legal concepts, while existing appellate review mechanisms face efficiency pressures from a surge in case volumes. Although current legal AI research focuses on tasks like judgment prediction and legal document generation, the task of judgment review differs fundamentally in its objectives and paradigm: it centers on detecting, classifying, and correcting errors after a judgment is issued, constituting anomaly detection rather than prediction or generation. To address this research gap, we introduce a novel task APPELLATE REVIEW, aiming to assess models'diagnostic reasoning and reliability in legal practice. We also construct a novel dataset benchmark AR-BENCH, which comprises 8,700 finely annotated decisions and 34,617 supplementary corpora. By evaluating 14 large language models, we reveal critical limitations in existing models'ability to identify legal application errors, providing empirical evidence for future improvements.
Problem

Research questions and friction points this paper is trying to address.

legal reasoning
judgment error
appellate review
anomaly detection
legal AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Appellate Review
Legal Reasoning
Judgment Error Detection
AR-BENCH
Anomaly Detection
🔎 Similar Papers
No similar papers found.
Y
Yifei Li
SKLCCSE, Beihang University, China
Richong Zhang
Richong Zhang
Professor of Computer Science, Beihang University
Data MiningRecommender SystemSocial Computing
W
Wanyu Tu
SCCE, University of Science and Technology Beijing, China
Zhijie Nie
Zhijie Nie
Ph.D. Candidate in Computer Science, Beihang University
Natural Language ProcessingInformation Retrieval
H
Haokun Luo
SKLCCSE, Beihang University, China
C
Chuantao Yin
Sino-French Engineer School, Beihang University, China
P
Pengchong Li
People’s Procuratorate of Beijing Municipality, China