BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses critical limitations in current legal large language model (LLM) evaluation—namely, fragmented workflows, low transparency, poor reproducibility, and limited involvement of non-technical legal experts. To overcome these challenges, the authors propose and implement an open-source web platform that, for the first time in the German legal domain, enables end-to-end benchmarking through fully integrated workflows encompassing task creation, collaborative annotation, configurable LLM execution, and multi-dimensional evaluation across lexical, semantic, factual, and judicial scoring dimensions. The platform incorporates multi-institutional collaboration mechanisms, tenant isolation, and role-based access control, while providing annotators with reference-grounded formative feedback. This system substantially enhances the transparency, reproducibility, and engagement of legal domain experts in AI evaluation processes.

Technology Category

Application Category

📝 Abstract

Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.

Problem

Research questions and friction points this paper is trying to address.

legal reasoning

large language models

benchmarking

reproducibility

expert annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

collaborative web platform

end-to-end benchmarking

legal reasoning evaluation

multi-tenant isolation

reference-grounded feedback

🔎 Similar Papers

No similar papers found.