🤖 AI Summary
Existing agent benchmarks fall short of meeting the stringent legal, procedural, and structural requirements of the public sector. This study addresses this gap by proposing, for the first time, a public administration–grounded framework for public-sector-oriented agent evaluation, grounded in first principles of public administration. The proposed benchmark emphasizes four essential criteria: procedural fidelity, real-world relevance, domain specificity, and targeted metrics. To operationalize this framework, the authors develop a scalable, automated analysis pipeline augmented by large language models and validated through expert consultation. Applying this approach to a systematic review of over 1,300 benchmarking studies reveals that no existing benchmark fully satisfies all four criteria. This work thus provides a theoretical foundation, methodological toolkit, and clear pathways for advancing the evaluation of AI agents in public sector contexts.
📝 Abstract
Deploying Large Language Model-based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public-sector institutions. Practitioners and researchers often turn to benchmarks for such assessments. However, it remains unclear what criteria benchmarks must meet to ensure they adequately reflect public-sector requirements, or how many existing benchmarks do so. In this paper, we first define such criteria based on a first-principles survey of public administration literature: benchmarks must be \emph{process-based}, \emph{realistic}, \emph{public-sector-specific} and report \emph{metrics} that reflect the unique requirements of the public sector. We analyse more than 1,300 benchmark papers for these criteria using an expert-validated LLM-assisted pipeline. Our results show that no single benchmark meets all of the criteria. Our findings provide a call to action for both researchers to develop public sector-relevant benchmarks and for public-sector officials to apply these criteria when evaluating their own agentic use cases.