🤖 AI Summary
This study addresses the absence of a systematic benchmark for evaluating subsumption reasoning under German law. The authors introduce BenGER, a novel dataset comprising 596 exam-style case questions and 531 doctrinal reasoning problems, establishing the first multi-tiered evaluation framework tailored to German legal subsumption. They further propose an LLM-as-a-Judge automated scoring methodology, which demonstrates high alignment with human judgments (r = 0.96) through extensive model comparisons, blinded human evaluations, and human–AI collaboration experiments. Results indicate that closed-source frontier models achieve overall superior performance, while human–LLM collaborative responses significantly outperform purely human-generated answers, underscoring the practical potential of large language models in German legal reasoning tasks.
📝 Abstract
We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. The BenGER dataset consists of three components: 596 exam-style free-text legal case tasks across multiple levels of legal education and 531 short doctrinal reasoning tasks. We evaluate 12 contemporary LLM systems -- closed flagship, efficiency-oriented, and open-weight -- across automatic and judge-based metrics. On a controlled validation subset of timed human-written solutions under both unaided and human--AI co-creation conditions, we contextualise model performance against these human baselines. We introduce a rubric-aligned LLM-as-a-Judge framework cross-validated against a multi-rater human-grading protocol (three blind reviews plus one author-informed creator review per solution). Our results show that replacing a blind human reviewer with the LLM judge degrades agreement with the full human pool no more than removing that reviewer altogether (Calderon r=0.96 vs.~r=0.96, matched n=30), that closed-flagship systems lead the leaderboard across all corpora, and that human--AI co-creation substantially outperforms unaided human work.