Agent Benchmarks Fail Public Sector Requirements

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing agent benchmarks fall short of meeting the stringent legal, procedural, and structural requirements of the public sector. This study addresses this gap by proposing, for the first time, a public administration–grounded framework for public-sector-oriented agent evaluation, grounded in first principles of public administration. The proposed benchmark emphasizes four essential criteria: procedural fidelity, real-world relevance, domain specificity, and targeted metrics. To operationalize this framework, the authors develop a scalable, automated analysis pipeline augmented by large language models and validated through expert consultation. Applying this approach to a systematic review of over 1,300 benchmarking studies reveals that no existing benchmark fully satisfies all four criteria. This work thus provides a theoretical foundation, methodological toolkit, and clear pathways for advancing the evaluation of AI agents in public sector contexts.

Technology Category

Application Category

📝 Abstract
Deploying Large Language Model-based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public-sector institutions. Practitioners and researchers often turn to benchmarks for such assessments. However, it remains unclear what criteria benchmarks must meet to ensure they adequately reflect public-sector requirements, or how many existing benchmarks do so. In this paper, we first define such criteria based on a first-principles survey of public administration literature: benchmarks must be \emph{process-based}, \emph{realistic}, \emph{public-sector-specific} and report \emph{metrics} that reflect the unique requirements of the public sector. We analyse more than 1,300 benchmark papers for these criteria using an expert-validated LLM-assisted pipeline. Our results show that no single benchmark meets all of the criteria. Our findings provide a call to action for both researchers to develop public sector-relevant benchmarks and for public-sector officials to apply these criteria when evaluating their own agentic use cases.
Problem

Research questions and friction points this paper is trying to address.

public sector
LLM agents
benchmarks
evaluation criteria
governance requirements
Innovation

Methods, ideas, or system contributions that make the work stand out.

public-sector benchmarks
LLM agents
process-based evaluation
realistic assessment
public administration
🔎 Similar Papers
No similar papers found.
J
Jonathan Rystrøm
Oxford Internet Institute, University of Oxford, Oxford, United Kingdom
Chris Schmitz
Chris Schmitz
PhD Student, Centre for Digital Governance, Hertie School
Karolina Korgul
Karolina Korgul
Oxford Internet Institute, University of Oxford
AI SafetyAI AgentsEvals
J
Jan Batzner
Weizenbaum Institute, Berlin, Germany; Technical University of Munich, Munich, Germany
Chris Russell
Chris Russell
Associate Professor, University of Oxford
Ethical Machine LearningComputer VisionOptimisationEthical AI