BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study addresses the absence of a systematic benchmark for evaluating subsumption reasoning under German law. The authors introduce BenGER, a novel dataset comprising 596 exam-style case questions and 531 doctrinal reasoning problems, establishing the first multi-tiered evaluation framework tailored to German legal subsumption. They further propose an LLM-as-a-Judge automated scoring methodology, which demonstrates high alignment with human judgments (r = 0.96) through extensive model comparisons, blinded human evaluations, and human–AI collaboration experiments. Results indicate that closed-source frontier models achieve overall superior performance, while human–LLM collaborative responses significantly outperform purely human-generated answers, underscoring the practical potential of large language models in German legal reasoning tasks.

📝 Abstract

We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. The BenGER dataset consists of three components: 596 exam-style free-text legal case tasks across multiple levels of legal education and 531 short doctrinal reasoning tasks. We evaluate 12 contemporary LLM systems -- closed flagship, efficiency-oriented, and open-weight -- across automatic and judge-based metrics. On a controlled validation subset of timed human-written solutions under both unaided and human--AI co-creation conditions, we contextualise model performance against these human baselines. We introduce a rubric-aligned LLM-as-a-Judge framework cross-validated against a multi-rater human-grading protocol (three blind reviews plus one author-informed creator review per solution). Our results show that replacing a blind human reviewer with the LLM judge degrades agreement with the full human pool no more than removing that reviewer altogether (Calderon r=0.96 vs.~r=0.96, matched n=30), that closed-flagship systems lead the leaderboard across all corpora, and that human--AI co-creation substantially outperforms unaided human work.

Problem

Research questions and friction points this paper is trying to address.

legal reasoning

subsumption

German law

LLM benchmarking

BenGER

Innovation

Methods, ideas, or system contributions that make the work stand out.

legal reasoning

LLM-as-a-Judge

subsumption