AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying LLM-driven multi-agent workflows in production introduces subtle, hard-to-detect errors, emergent behaviors, and systemic risks that evade conventional monitoring. Method: We propose the first production-oriented post-deployment monitoring and debugging evaluation framework for such systems. It features a multi-stage analytical pipeline integrating topic clustering and quantitative scoring; a novel dual-memory architecture—comprising episodic and semantic memory—to enable cross-episode continual learning; and expert-informed debugging logic to uncover deep-seated flaws often missed by human annotation. Contribution/Results: Evaluated on real-world deployments and the TRAIL benchmark, our framework significantly enhances observability and reliability. It achieves state-of-the-art performance across key metrics—including anomaly detection precision, root-cause localization accuracy, and workflow stability—demonstrating robustness under dynamic, open-ended operational conditions.

Technology Category

Application Category

📝 Abstract
With the growing adoption of Large Language Models (LLMs) in automating complex, multi-agent workflows, organizations face mounting risks from errors, emergent behaviors, and systemic failures that current evaluation methods fail to capture. We present AgentCompass, the first evaluation framework designed specifically for post-deployment monitoring and debugging of agentic workflows. AgentCompass models the reasoning process of expert debuggers through a structured, multi-stage analytical pipeline: error identification and categorization, thematic clustering, quantitative scoring, and strategic summarization. The framework is further enhanced with a dual memory system-episodic and semantic-that enables continual learning across executions. Through collaborations with design partners, we demonstrate the framework's practical utility on real-world deployments, before establishing its efficacy against the publicly available TRAIL benchmark. AgentCompass achieves state-of-the-art results on key metrics, while uncovering critical issues missed in human annotations, underscoring its role as a robust, developer-centric tool for reliable monitoring and improvement of agentic systems in production.
Problem

Research questions and friction points this paper is trying to address.

Detecting errors in multi-agent workflows post-deployment
Monitoring emergent behaviors in automated LLM systems
Debugging systemic failures in production agent workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured multi-stage analytical pipeline for debugging
Dual memory system enabling continual learning
Developer-centric framework for production monitoring
🔎 Similar Papers
No similar papers found.
N
NVJK Kartik
FutureAGI Inc.
G
Garvit Sapra
FutureAGI Inc.
Rishav Hada
Rishav Hada
Microsoft Research
Natural Language ProcessingMachine LearningArtificial Intelligence
N
Nikhil Pareek
FutureAGI Inc.