BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the absence of a systematic benchmark for evaluating subsumption reasoning under German law. The authors introduce BenGER, a novel dataset comprising 596 exam-style case questions and 531 doctrinal reasoning problems, establishing the first multi-tiered evaluation framework tailored to German legal subsumption. They further propose an LLM-as-a-Judge automated scoring methodology, which demonstrates high alignment with human judgments (r = 0.96) through extensive model comparisons, blinded human evaluations, and human–AI collaboration experiments. Results indicate that closed-source frontier models achieve overall superior performance, while human–LLM collaborative responses significantly outperform purely human-generated answers, underscoring the practical potential of large language models in German legal reasoning tasks.
📝 Abstract
We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. The BenGER dataset consists of three components: 596 exam-style free-text legal case tasks across multiple levels of legal education and 531 short doctrinal reasoning tasks. We evaluate 12 contemporary LLM systems -- closed flagship, efficiency-oriented, and open-weight -- across automatic and judge-based metrics. On a controlled validation subset of timed human-written solutions under both unaided and human--AI co-creation conditions, we contextualise model performance against these human baselines. We introduce a rubric-aligned LLM-as-a-Judge framework cross-validated against a multi-rater human-grading protocol (three blind reviews plus one author-informed creator review per solution). Our results show that replacing a blind human reviewer with the LLM judge degrades agreement with the full human pool no more than removing that reviewer altogether (Calderon r=0.96 vs.~r=0.96, matched n=30), that closed-flagship systems lead the leaderboard across all corpora, and that human--AI co-creation substantially outperforms unaided human work.
Problem

Research questions and friction points this paper is trying to address.

legal reasoning
subsumption
German law
LLM benchmarking
BenGER
Innovation

Methods, ideas, or system contributions that make the work stand out.

legal reasoning
LLM-as-a-Judge
subsumption
benchmarking
human-AI co-creation
S
Sebastian Nagl
Technical University of Munich (TUM)
A
Ann-Kristin Mayrhofer
Ludwig Maximilian University of Munich (LMU)
M
Martin Heidebach
Ludwig Maximilian University of Munich (LMU)
A
Aleyna Koçak
University of Konstanz
A
Anne Zettelmeier
University of Saarbrücken
E
Elly Breu
Technical University of Munich (TUM)
A
Angelina Greiner
Technical University of Munich (TUM)
S
Sofija Milijas
Technical University of Munich (TUM)
Matthias Grabmair
Matthias Grabmair
Technical University of Munich
Data ScienceArtificial Intelligence & LawKnowledge Representation & Reasoning