EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the poor performance of current large language models (LLMs) on enterprise-scale, long-horizon tasks, primarily due to the absence of evaluation benchmarks that reflect real-world business state dynamics and access control constraints. To bridge this gap, the authors propose the first embodied agent evaluation framework tailored for enterprise workflows, featuring a containerized sandbox environment with 164 database tables, 512 tools, and 1,150 expert-designed tasks, emphasizing long-term planning, state persistence, and permission-aware execution. Experimental results reveal that even the best-performing model, Claude Opus 4.5, achieves only a 37.4% task success rate; providing human-generated plans improves performance by 14–35 percentage points. Furthermore, models fail to reject infeasible tasks over 46% of the time, exposing critical deficiencies in strategic reasoning and task feasibility assessment.

Technology Category

Application Category

📝 Abstract

Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.

Problem

Research questions and friction points this paper is trying to address.

Enterprise AI

Agentic Planning

Stateful Environments

Tool Use

Benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

EnterpriseOps-Gym

stateful agentic planning

tool use