TeamBench: Evaluating Agent Coordination under Enforced Role Separation

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This study addresses the challenge in evaluating multi-agent teams, where existing benchmarks fail to distinguish genuine collaboration from role overreach due to the absence of responsibility isolation. To remedy this, the authors propose TeamBench, a novel benchmark that introduces operating system–level permission controls, explicitly partitioning tasks into three distinct roles—planning, execution, and verification—and enforcing strict separation of information access, editing, and authentication privileges. Leveraging 851 task templates, 931 instances, a deterministic scoring mechanism, and human-controlled experiments, the study reveals that prompt-based role constraints alone are insufficient, resulting in 3.6 times more role violations than under sandboxed isolation; furthermore, 49% of erroneous submissions were erroneously approved by verifiers. Crucially, team collaboration proves beneficial only when individual agent capabilities are weak, becoming detrimental otherwise, thereby highlighting the dependence of collaborative efficacy on individual competence.

📝 Abstract

Agent systems often decompose a task across multiple roles, but these roles are typically specified by prompts rather than enforced by access controls. Without enforcement, a team pass rate can mask whether agents actually coordinated or whether one role effectively did another role's work. We present TeamBench, a benchmark with 851 task templates and 931 seeded instances for evaluating agent coordination under operating system-enforced role separation. TeamBench separates specification access, workspace editing, and final certification across Planner, Executor, and Verifier roles, so that no role can read the full requirements, modify the workspace, and certify the final answer. Prompt-only and sandbox-enforced teams reach statistically indistinguishable pass rates, but prompt-only runs produce 3.6 times more cases where the verifier attempts to edit the executor's code. Verifiers approve 49% of submissions that fail the deterministic grader, and removing the verifier improves mean partial score in the ablation. Team value is also conditional. Teams benefit when single agents struggle, but hurt when single agents already perform well. A 40-session human study under the same role separation shows that our benchmark exposes interaction patterns that pass rate misses. Solo participants work through the task directly, human participants paired with agents often collapse into quick approval, and human teams spend more effort coordinating missing information across roles.

Problem

Research questions and friction points this paper is trying to address.

agent coordination

role separation

access control

team evaluation

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

role separation

agent coordination

access control