🤖 AI Summary
Evaluating LLM-based agents is challenging due to their dynamic, probabilistic, and continuously evolving nature—traditional predefined benchmarks fail to capture open-ended behaviors, emergent outcomes, and lifecycle adaptation.
Method: We propose an evaluation-driven agent development paradigm, integrating online runtime evaluation with offline reconstruction evaluation. Our hybrid framework enables real-time feedback injection, human-AI collaborative closed-loop refinement, and iterative optimization across the full stack (pipeline, architecture, and LLM), incorporating both human and AI evaluators.
Contribution/Results: We introduce the first evaluation-centric process model and reference architecture for LLM agent development. It uniquely supports open-behavior capture, emergent-result governance, and dynamic alignment. Experiments demonstrate that the framework effectively enables safe, controllable, and continuous agent iteration under objective drift, requirement changes, and regulatory evolution—achieving robust adaptability without compromising reliability or compliance.
📝 Abstract
Large Language Models (LLMs) have enabled the emergence of LLM agents: autonomous systems capable of achieving under-specified goals and adapting post-deployment, often without explicit code or model changes. Evaluating these agents is critical to ensuring their performance and safety, especially given their dynamic, probabilistic, and evolving nature. However, traditional approaches such as predefined test cases and standard redevelopment pipelines struggle to address the unique challenges of LLM agent evaluation. These challenges include capturing open-ended behaviors, handling emergent outcomes, and enabling continuous adaptation over the agent's lifecycle. To address these issues, we propose an evaluation-driven development approach, inspired by test-driven and behavior-driven development but reimagined for the unique characteristics of LLM agents. Through a multivocal literature review (MLR), we synthesize the limitations of existing LLM evaluation methods and introduce a novel process model and reference architecture tailored for evaluation-driven development of LLM agents. Our approach integrates online (runtime) and offline (redevelopment) evaluations, enabling adaptive runtime adjustments and systematic iterative refinement of pipelines, artifacts, system architecture, and LLMs themselves. By continuously incorporating evaluation results, including fine-grained feedback from human and AI evaluators, into each stage of development and operation, this framework ensures that LLM agents remain aligned with evolving goals, user needs, and governance standards.