Measuring Agents in Production

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of systematic understanding of AI agent deployment in production environments. We conduct the largest empirical investigation to date—surveying 306 practitioners and conducting 20 in-depth case studies across 26 domains—using a mixed-methods approach (quantitative surveys + qualitative interviews). For the first time, we empirically characterize real-world technical choices, development paradigms, and evaluation practices for production-grade agents. Key findings include: 68% of production agents require human intervention within ten steps; 70% rely on prompt engineering rather than model fine-tuning; and 74% primarily use human evaluation. Simpler, more controllable methods dominate practice, yet reliability remains the foremost bottleneck. The study bridges the gap between academic research and industrial practice by establishing the first large-scale, evidence-based benchmark map and challenge framework for agent engineering.

Technology Category

Application Category

📝 Abstract
AI agents are actively running in production across diverse industries, yet little is publicly known about which technical approaches enable successful real-world deployments. We present the first large-scale systematic study of AI agents in production, surveying 306 practitioners and conducting 20 in-depth case studies via interviews across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and what the top development challenges are. We find that production agents are typically built using simple, controllable approaches: 68% execute at most 10 steps before requiring human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability remains the top development challenge, driven by difficulties in ensuring and evaluating agent correctness. Despite these challenges, simple yet effective methods already enable agents to deliver impact across diverse industries. Our study documents the current state of practice and bridges the gap between research and deployment by providing researchers visibility into production challenges while offering practitioners proven patterns from successful deployments.
Problem

Research questions and friction points this paper is trying to address.

Investigates technical approaches enabling real-world AI agent deployments.
Examines why, how, and evaluation methods for building production agents.
Identifies reliability as the top development challenge for AI agents.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simple, controllable AI agent approaches
Prompting off-the-shelf models without tuning
Human evaluation and intervention for reliability
M
Melissa Z. Pan
UC Berkeley
Negar Arabzadeh
Negar Arabzadeh
UC Berkeley
Information retrievalNatural Language ProcessingEvaluation
R
Riccardo Cogo
Intesa Sanpaolo
Yuxuan Zhu
Yuxuan Zhu
PhD student, University of Illinois Urbana-Champaign
Data systemsAI evaluation
A
Alexander Xiong
UC Berkeley
Lakshya A Agrawal
Lakshya A Agrawal
University of California, Berkeley
Large Language ModelsAI4CodeArtificial IntelligenceProgramming LanguagesSoftware Engineering
H
Huanzhi Mao
UC Berkeley
E
Emma Shen
UC Berkeley
S
Sid Pallerla
UC Berkeley
L
Liana Patel
Stanford University
S
Shu Liu
UC Berkeley
Tianneng Shi
Tianneng Shi
UC Berkeley
X
Xiaoyuan Liu
UC Berkeley
Jared Quincy Davis
Jared Quincy Davis
Foundry | Stanford University
machine learningdeep learningreinforcement learningsystems
E
Emmanuele Lacavalla
Intesa Sanpaolo
A
Alessandro Basile
Intesa Sanpaolo
Shuyi Yang
Shuyi Yang
Intesa Sanpaolo
Agentic AISemi-Supervised LearningFairnessDifferential PrivacyNetSci
Paul Castro
Paul Castro
Senior Research Manager and Scientist, IBM Research
Cloud ComputingMobile Computing
Daniel Kang
Daniel Kang
UIUC
Computer Science
Joseph E. Gonzalez
Joseph E. Gonzalez
Professor of Computer Science, UC Berkeley
Machine LearningComputer Systems
Koushik Sen
Koushik Sen
Professor of Computer Science, University of California, Berkeley
Computer ScienceTestingDebuggingProgram AnalysisConcurrency
Dawn Song
Dawn Song
Professor of Computer Science, UC Berkeley
Computer Security and Privacy
Ion Stoica
Ion Stoica
Professor of Computer Science, UC Berkeley
Cloud ComputingNetworkingDistributed SystemsBig Data
Matei Zaharia
Matei Zaharia
UC Berkeley and Databricks
Distributed SystemsMachine LearningDatabasesSecurity
M
Marquita Ellis
IBM Research