PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Debugging multi-agent large language model (LLM) workflows is challenging due to error propagation and difficulty in localizing faults at intermediate nodes. To address this, this work proposes PROTEA, a framework that adopts an offline, test-driven approach to evaluate node outputs post-execution using configurable scoring rules. PROTEA integrates graph-level visualization with overlaid execution states and reasoning rationales to identify bottlenecks. A key innovation is its use of the final answer to retroactively generate expected outputs for individual nodes, enabling targeted prompt editing and automated re-execution for validation. Evaluated on two production-scale workflows, PROTEA improved document verification accuracy from 64.3% to 83.9% and increased Hit@5 for recommendation tasks from 0.30 to 0.38. User studies further demonstrate that its graph-based fault localization, reasoning traceability, and prompt comparison features significantly enhance developer productivity.

📝 Abstract

Multi-agent LLM workflows -- systems composed of multiple role-specific LLM calls -- often outperform single-prompt baselines, but they remain difficult to debug and refine. Failures can originate from subtle errors in intermediate outputs that propagate to downstream nodes, requiring developers to inspect long traces and infer which agent to modify. We present PROTEA, a unified interface for offline, test-driven improvement of multi-agent workflows. PROTEA executes a workflow, scores intermediate node outputs with configurable rubrics, and overlays per-node states and rationales on the workflow graph to localize likely bottlenecks. To support complex systems where final-answer references are the primary supervision, PROTEA performs backward node evaluation: it generates candidate node-level expectations from final-answer references and graph context, then compares them with observed node outputs. For selected nodes, PROTEA presents targeted prompt revisions as editable before/after comparisons, then automatically reruns and re-evaluates the workflow to show output changes and score trajectories within the same interface. In two production-adjacent workflows, PROTEA improved document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38. In a formative study with six experienced LLM developers, participants valued graph-level localization, per-node rationales, and editable before/after prompt revisions.

Problem

Research questions and friction points this paper is trying to address.

multi-agent LLM workflows

offline evaluation

debugging

iterative refinement

error propagation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent LLM workflows

offline evaluation

backward node evaluation