BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

๐Ÿ“… 2026-04-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

171K/year
๐Ÿค– AI Summary
This work addresses critical limitations in current legal large language model (LLM) evaluationโ€”namely, fragmented workflows, low transparency, poor reproducibility, and limited involvement of non-technical legal experts. To overcome these challenges, the authors propose and implement an open-source web platform that, for the first time in the German legal domain, enables end-to-end benchmarking through fully integrated workflows encompassing task creation, collaborative annotation, configurable LLM execution, and multi-dimensional evaluation across lexical, semantic, factual, and judicial scoring dimensions. The platform incorporates multi-institutional collaboration mechanisms, tenant isolation, and role-based access control, while providing annotators with reference-grounded formative feedback. This system substantially enhances the transparency, reproducibility, and engagement of legal domain experts in AI evaluation processes.

Technology Category

Application Category

๐Ÿ“ Abstract
Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.
Problem

Research questions and friction points this paper is trying to address.

legal reasoning
large language models
benchmarking
reproducibility
expert annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

collaborative web platform
end-to-end benchmarking
legal reasoning evaluation
multi-tenant isolation
reference-grounded feedback
๐Ÿ”Ž Similar Papers
No similar papers found.