๐ค AI Summary
This work addresses critical limitations in current legal large language model (LLM) evaluationโnamely, fragmented workflows, low transparency, poor reproducibility, and limited involvement of non-technical legal experts. To overcome these challenges, the authors propose and implement an open-source web platform that, for the first time in the German legal domain, enables end-to-end benchmarking through fully integrated workflows encompassing task creation, collaborative annotation, configurable LLM execution, and multi-dimensional evaluation across lexical, semantic, factual, and judicial scoring dimensions. The platform incorporates multi-institutional collaboration mechanisms, tenant isolation, and role-based access control, while providing annotators with reference-grounded formative feedback. This system substantially enhances the transparency, reproducibility, and engagement of legal domain experts in AI evaluation processes.
๐ Abstract
Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.