🤖 AI Summary
Existing agent benchmarks struggle to address the challenges posed by enterprise-grade GUI systems, such as high interface density, stringent business logic, and strict state consistency requirements. This work proposes EntWorld—the first comprehensive evaluation environment tailored for enterprise-level agents—encompassing 1,756 tasks across six domains including CRM, ITIL, and ERP. By reverse-engineering realistic, long-horizon workflows from database schemas and introducing an SQL-driven state validation mechanism, EntWorld enables deterministic evaluation without requiring human annotations or execution traces. Experimental results reveal that even state-of-the-art models like GPT-4.1 achieve only a 47.61% success rate on this benchmark, substantially below human performance, thereby systematically exposing the capability gap of general-purpose agents in enterprise scenarios for the first time.
📝 Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have enabled agents to operate in open-ended web and operating system environments. However, existing benchmarks predominantly target consumer-oriented scenarios (e.g., e-commerce and travel booking), failing to capture the complexity and rigor of professional enterprise workflows. Enterprise systems pose distinct challenges, including high-density user interfaces, strict business logic constraints, and a strong reliance on precise, state-consistent information retrieval-settings in which current generalist agents often struggle. To address this gap, we introduce EntWorld, a large-scale benchmark consisting of 1,756 tasks across six representative enterprise domains, including customer relationship management (CRM), information technology infrastructure library (ITIL), and enterprise resource planning (ERP) systems. Unlike previous datasets that depend on fragile execution traces or extensive manual annotation, EntWorld adopts a schema-grounded task generation framework that directly reverse-engineers business logic from underlying database schemas, enabling the synthesis of realistic, long-horizon workflows. Moreover, we propose a SQL-based deterministic verification mechanism in building datasets that replaces ambiguous visual matching with rigorous state-transition validation. Experimental results demonstrate that state-of-the-art models (e.g., GPT-4.1) achieve 47.61% success rate on EntWorld, substantially lower than the human performance, highlighting a pronounced enterprise gap in current agentic capabilities and the necessity of developing domain-specific agents. We release EntWorld as a rigorous testbed to facilitate the development and evaluation of the next generation of enterprise-ready digital agents.