LEXam: Benchmarking Legal Reasoning on 340 Law Exams

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit weak structured, multi-step reasoning capabilities on long-form, open-ended legal examination questions. Method: We introduce LEXam—the first large-scale, multilingual, cross-curricular legal exam benchmark—comprising 4,886 English-German bilingual questions, accompanied by fine-grained, expert-verified reasoning path annotations. We propose an “LLM-as-a-Judge” evaluation paradigm integrating question-type classification, stepwise reasoning guidance, and legal expert validation into an automated scoring protocol, moving beyond answer accuracy alone. Contribution/Results: Experiments reveal systematic deficiencies in current state-of-the-art LLMs across core legal reasoning tasks—including element analysis and rule application—enabling fine-grained capability differentiation and reproducible, quantitative assessment of legal reasoning performance.

Technology Category

Application Category

📝 Abstract
Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. We introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Adopting an LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. Project page: https://lexam-benchmark.github.io/
Problem

Research questions and friction points this paper is trying to address.

Benchmarking long-form legal reasoning in LLMs using 340 law exams
Evaluating LLMs on structured, multi-step legal reasoning challenges
Assessing legal reasoning quality beyond simple accuracy metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

LEXam benchmark with 340 law exams
LLM-as-a-Judge for legal reasoning evaluation
Multi-step legal reasoning guidance included
🔎 Similar Papers
No similar papers found.
Yu Fan
Yu Fan
ETH Zurich
Natural Language ProcessingLegal NLPComputational Social Science
Jingwei Ni
Jingwei Ni
Doctoral Researcher in NLP, ETH Zurich
NLP for social goodclaim verificationcausal NLPcomputational social science
J
Jakob Merane
University of Zurich, Max Planck Institute for Research on Collective Goods
Etienne Salimbeni
Etienne Salimbeni
EPFL
Deep LearningTrustworthy AI
Y
Yang Tian
University of Zurich
Y
Yoan Hermstruwer
University of Zurich, Max Planck Institute for Research on Collective Goods
Yinya Huang
Yinya Huang
Postdoc Fellow at ETH AI Center, ETH Zürich; Prev. CityU Hong Kong, SYSU
AI for MathAI for ScienceReliable Machine LearningLLMsNLP
Mubashara Akhtar
Mubashara Akhtar
ETH AI Center fellow at ETH Zurich
NLPMultimodalityBenchmarking & Evaluation
F
Florian Geering
University of Zurich
O
Oliver Dreyer
University of St. Gallen
Daniel Brunner
Daniel Brunner
CNRS researcher, FEMTO-ST, Optics department, Besancon
Photonic neural networksunconventional computationsemiconductor nonlinear opticscomplex photonicsnonlinear dynamics
Markus Leippold
Markus Leippold
University of Zurich, Department of Finance
FinanceClimate ChangeNatural Language ProcessingFinancial EconomicsMathematical Finance
Mrinmaya Sachan
Mrinmaya Sachan
Assistant Professor, ETH Zürich
Natural Language ProcessingReasoningAI for Education
A
Alexander Stremitzer
ETH Zurich
C
Christoph Engel
Max Planck Institute for Research on Collective Goods
Elliott Ash
Elliott Ash
Associate Professor of Law, Economics, and Data Science
Law and EconomicsPolitical EconomyText as DataLarge Language Models
Joel Niklaus
Joel Niklaus
Hugging Face, Stanford University
Natural Language ProcessingLegal NLPLegal AI