🤖 AI Summary
This work addresses the limitations of existing enterprise agent evaluations, which often overlook critical real-world constraints such as role specialization, access control, and state dependencies in collaborative workflows. To bridge this gap, we introduce EntCollabBench—the first benchmark specifically designed for evaluating multi-agent collaboration in enterprise settings. It simulates a permission-isolated organization comprising six departments and eleven role-specific agents, assessing collaborative performance through workflow execution and policy approval tasks. The benchmark innovatively incorporates role specialization, permission isolation, state validation, and a deterministic policy adjudication mechanism, leveraging LLM-based agents, a stateful database, and execution trace tracking. Our experiments reveal significant shortcomings in current mainstream LLM agents regarding task delegation, context propagation, parameter grounding, workflow closure, and decision commitment, thereby establishing a reproducible evaluation platform for future research.
📝 Abstract
Large language model (LLM) agents are increasingly expected to operate in enterprise environments, where work is distributed across specialized roles, permission-controlled systems, and cross-departmental procedures. However, existing enterprise benchmarks largely evaluate single agents with broad tool access, while existing multi-agent benchmarks rarely capture realistic enterprise constraints such as role specialization, access control, stateful business systems, and policy-based approvals. We introduce \textsc{EntCollabBench}, a benchmark for evaluating enterprise multi-agent collaboration. \textsc{EntCollabBench} simulates a permission-isolated organization with 11 role-specialized agents across six departments and contains two evaluation subsets: a Workflow subset, where agents collaboratively modify enterprise system states, and an Approval subset, where agents make policy-grounded decisions. Evaluation is based on execution traces, database state verification, and deterministic policy adjudication rather than natural-language response judging. Experiments with representative LLM agents show that current models still struggle with end-to-end enterprise collaboration, especially in delegation, context transfer, parameter grounding, workflow closure, and decision commitment. \textsc{EntCollabBench} provides a reproducible testbed for measuring and improving agent systems intended for realistic organizational environments.