An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Foundation model (FM)-driven AI agents exhibit inherent non-determinism and irreproducibility, posing significant challenges for systematic testing. Method: We conduct a large-scale empirical study of testing practices across 39 open-source frameworks and 439 real-world applications. Contribution/Results: We systematically identify ten distinct testing patterns and uncover a “test effort inversion”: over 70% of tests target deterministic components (e.g., tools, pipelines), while fewer than 5% assess the core FM—indicating near-absence of model-centric validation. We introduce the first empirical benchmark for AI agent testing, quantifying adoption rates of conventional testing methods versus specialized tools (e.g., DeepEval). Results show current practices only partially mitigate uncertainty and critically lack robust verification of prompts and model behavior. We argue for framework-level support and prompt- and model-aware regression testing mechanisms to bridge this gap.

Technology Category

Application Category

📝 Abstract

Foundation model (FM)-based AI agents are rapidly gaining adoption across diverse domains, but their inherent non-determinism and non-reproducibility pose testing and quality assurance challenges. While recent benchmarks provide task-level evaluations, there is limited understanding of how developers verify the internal correctness of these agents during development. To address this gap, we conduct the first large-scale empirical study of testing practices in the AI agent ecosystem, analyzing 39 open-source agent frameworks and 439 agentic applications. We identify ten distinct testing patterns and find that novel, agent-specific methods like DeepEval are seldom used (around 1%), while traditional patterns like negative and membership testing are widely adapted to manage FM uncertainty. By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM-based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests. Our findings offer the first empirical testing baseline in FM-based agent frameworks and agentic applications, revealing a rational but incomplete adaptation to non-determinism. To address it, framework developers should improve support for novel testing methods, application developers must adopt prompt regression testing, and researchers should explore barriers to adoption. Strengthening these practices is vital for building more robust and dependable AI agents.

Problem

Research questions and friction points this paper is trying to address.

Testing challenges in AI agents due to non-determinism and non-reproducibility

Limited understanding of internal correctness verification during agent development

Critical blind spot in testing effort distribution across agent components

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conducts large-scale empirical study of testing practices

Identifies ten distinct testing patterns in agent frameworks

Reveals inversion of testing effort favoring deterministic components

🔎 Similar Papers

AI-powered test automation tools: A systematic review and empirical evaluation