🤖 AI Summary
Existing approaches struggle to verify whether declared coordination structures—such as tool access policies and agent routing paths—in multi-agent workflows are faithfully executed, leading to insufficient test coverage and an inability to detect structural regressions. This work proposes the first structural testing methodology for multi-agent workflows, formalizing workflows via typed coordination graphs, defining structural coverage criteria, and leveraging DSPy-driven natural language scenario generation to automatically construct executable test cases that precisely probe permitted and restricted tool invocations as well as agent delegation behaviors. Evaluation on ten SDK benchmarks demonstrates that the method achieves coverage of 54 out of 75 tool obligations and 36 out of 48 delegation obligations, while adversarial testing uncovers 23 policy-violating invocations, effectively exposing structural misconfigurations.
📝 Abstract
Multi-agent systems increasingly expose explicit workflow structure: agents, tools, tool-access rules, restrictions, and delegation paths. Existing evaluations rely largely on end-to-end task success, benchmark scores, final-response quality, or prompt-level checks, which provide limited evidence that this declared coordination structure has actually been exercised. This makes it difficult to assess test-suite adequacy or detect structural regressions in tool access, restrictions, and inter-agent delegation. We address this gap with a structural testing approach for multi-agent workflow specifications. The approach represents each workflow as a typed coordination graph, derives coverage obligations over reachable agents, allowed tool edges, restricted tool edges, and delegation edges, and uses coverage-driven generation with DSPy-based scenario realization to produce executable tests. The graph fixes what must be covered; DSPy realizes those obligations as natural-language scenarios whose witnesses are checked at runtime. We implement the approach for OpenAI Agents SDK-style workflows and evaluate it on ten SDK-derived benchmarks comprising 49 reachable agents, 47 tools, and 403 structural obligations. Generated scenarios witness 54/75 allowed-tool obligations and 36/48 delegation obligations within a bounded refinement budget. The adversarial restricted-tool criterion elicits 23/248 restricted-call violations, separating workflows whose restrictions hold under probing from workflows with concrete misrouting failures. These results show that structural coverage provides a useful adequacy layer for multi-agent workflow testing: it does not replace semantic or end-to-end evaluation, but reveals whether declared agents, tool-access rules, restrictions, and delegation paths have been exercised.