PerspectiveGap: A Benchmark for Multi-Agent Orchestration Prompting

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the challenges of precise role assignment and the lack of effective evaluation benchmarks in large language model (LLM)-based multi-agent orchestration. The authors propose PerspectiveGap, the first prompt evaluation framework grounded in Prompt Economy principles, encompassing 110 real-world engineering scenarios. It features two task types—role snippet allocation and free-form prompt writing—and defines ten communication topologies. By introducing quantitative metrics such as information leakage rate and pass rate, and integrating distractor mixing with contextualized evaluation, the framework systematically assesses 27 mainstream LLMs. Results reveal that even the best-performing model, GPT-5.5, achieves only a 62.0% overall pass rate, with an average pass rate of 14.9% across all models and an alarmingly high information leakage rate of 246.5%, underscoring the significant difficulty of this task.

📝 Abstract

Real-world LLM applications are moving beyond single-agent workflows toward orchestrated multi-agent systems, yet current models still struggle to determine what each sub-agent needs to know. To measure this, we introduce PerspectiveGap, a benchmark for evaluating LLMs' ability to compose orchestration prompts for multi-agent systems. PerspectiveGap contains 110 scenarios, each evaluated through two distractor-mixed task formats: role-fragment assignment and free-form prompt writing. These scenarios are organized into 10 topologies, which are distilled from the authors' real-world engineering practice and framed by the Prompt Economy principle: building loop-centered orchestrations that maximize utility with minimal role and engineering overhead. In experiments with 27 commercial models from 10 companies, GPT-5.5 substantially outperforms all competitors, whereas Opus 4.7 shows a notable weakness in orchestration prompting despite its strong coding performance. Nevertheless, PerspectiveGap remains challenging: the evaluated models achieve an average combined pass rate of only 14.9\% (GPT-5.5 62.0\%) and an average overall leakage rate of 246.5\% (a per-scenario information leak-event count, not a proportion; GPT-5.5 49.1\%). These findings suggest that multi-agent orchestration prompting is a distinct and under-evaluated capability, and PerspectiveGap provides a foundation for measuring and improving it systematically.

Problem

Research questions and friction points this paper is trying to address.

multi-agent orchestration

prompting

PerspectiveGap

role assignment

information leakage

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent orchestration

Prompt Economy

PerspectiveGap