Agent Benchmarks Fail Public Sector Requirements

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing agent benchmarks fall short of meeting the stringent legal, procedural, and structural requirements of the public sector. This study addresses this gap by proposing, for the first time, a public administration–grounded framework for public-sector-oriented agent evaluation, grounded in first principles of public administration. The proposed benchmark emphasizes four essential criteria: procedural fidelity, real-world relevance, domain specificity, and targeted metrics. To operationalize this framework, the authors develop a scalable, automated analysis pipeline augmented by large language models and validated through expert consultation. Applying this approach to a systematic review of over 1,300 benchmarking studies reveals that no existing benchmark fully satisfies all four criteria. This work thus provides a theoretical foundation, methodological toolkit, and clear pathways for advancing the evaluation of AI agents in public sector contexts.

Technology Category

Application Category

📝 Abstract

Deploying Large Language Model-based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public-sector institutions. Practitioners and researchers often turn to benchmarks for such assessments. However, it remains unclear what criteria benchmarks must meet to ensure they adequately reflect public-sector requirements, or how many existing benchmarks do so. In this paper, we first define such criteria based on a first-principles survey of public administration literature: benchmarks must be \emph{process-based}, \emph{realistic}, \emph{public-sector-specific} and report \emph{metrics} that reflect the unique requirements of the public sector. We analyse more than 1,300 benchmark papers for these criteria using an expert-validated LLM-assisted pipeline. Our results show that no single benchmark meets all of the criteria. Our findings provide a call to action for both researchers to develop public sector-relevant benchmarks and for public-sector officials to apply these criteria when evaluating their own agentic use cases.

Problem

Research questions and friction points this paper is trying to address.

public sector

LLM agents

benchmarks

evaluation criteria

governance requirements

Innovation

Methods, ideas, or system contributions that make the work stand out.

public-sector benchmarks

LLM agents

process-based evaluation