🤖 AI Summary
To address case backlogs in Swiss courts, this paper proposes a multilingual legal precedent criticality prediction method to support intelligent prioritization of pending cases. Methodologically, we design an algorithm-driven, two-tier automatic annotation framework: (1) a binary “Leading Decision” label, and (2) a fine-grained ranking label integrating citation frequency and time-decay weighting. We conduct supervised fine-tuning on multilingual foundation models—including XLM-R and mBERT—to circumvent the bottleneck of manual annotation. Our contributions are twofold: first, we empirically demonstrate—within a multilingual legal domain—that fine-tuned smaller models significantly outperform larger zero-shot models; second, we validate the pivotal role of high-quality, domain-specific training data for professional NLP tasks. On the Criticality Prediction dataset, fine-tuned models achieve substantially higher F1 scores and ranking metrics compared to zero-shot baselines.
📝 Abstract
Many court systems are overwhelmed all over the world, leading to huge backlogs of pending cases. Effective triage systems, like those in emergency rooms, could ensure proper prioritization of open cases, optimizing time and resource allocation in the court system. In this work, we introduce the Criticality Prediction dataset, a novel resource for evaluating case prioritization. Our dataset features a two-tier labeling system: (1) the binary LD-Label, identifying cases published as Leading Decisions (LD), and (2) the more granular Citation-Label, ranking cases by their citation frequency and recency, allowing for a more nuanced evaluation. Unlike existing approaches that rely on resource-intensive manual annotations, we algorithmically derive labels leading to a much larger dataset than otherwise possible. We evaluate several multilingual models, including both smaller fine-tuned models and large language models in a zero-shot setting. Our results show that the fine-tuned models consistently outperform their larger counterparts, thanks to our large training set. Our results highlight that for highly domain-specific tasks like ours, large training sets are still valuable.