🤖 AI Summary
This study addresses the challenge of identifying implicit references to the French Civil Code in decisions rendered by French courts of first instance and distinguishing genuine legal reasoning from factual descriptions or semantically similar but legally irrelevant content. To this end, the authors construct a benchmark dataset of 1,015 judgment paragraph–statute pairs annotated by three legal experts, and systematically analyze the impact of inter-annotator disagreement on model performance. They find that discrepancies in labeling at the boundary between factual narrative and legal reasoning constitute a primary source of model failure. To mitigate this, they propose an unsupervised top-k ranking method based on multi-model consensus, complemented by supervised ensemble learning and legal semantic analysis. Experimental results show that the supervised model achieves an F1 score of 0.70 (precision: 77%), while the unsupervised approach attains 76% precision in the top-200 setting; most false positives stem from inherent ambiguities in legal application rather than model error.
📝 Abstract
Computational methods applied to legal scholarship hold the promise of analyzing law at scale. We start from a simple question: how often do courts implicitly apply statutory rules? This requires distinguishing legal reasoning from semantic similarity. We focus on implicit citation of the French Civil Code in first-instance court decisions and introduce a benchmark of 1,015 passage-article pairs annotated by three legal experts. We show that expert disagreement predicts model failures. Inter-annotator agreement is moderate ($κ$ = 0.33) with 43% of disagreements involving the boundary between factual description and legal reasoning. Our supervised ensemble achieves F1 = 0.70 (77% accuracy), but this figure conceals an asymmetry: 68% of false positives fall on the 33% of cases where the annotators disagreed. Despite these limits, reframing the task as top-k ranking and leveraging multi-model consensus yields 76% precision at k = 200 in an unsupervised setting. Moreover, the remaining false positives tend to surface legally ambiguous applications rather than obvious errors.