World of Workflows: A Benchmark for Bringing World Models to Enterprise Systems

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that state-of-the-art large language models struggle to model latent enterprise workflows and their cascading side effects, often violating implicit constraints due to limited observability. To bridge this gap, we construct World of Workflows (WoW), a realistic enterprise simulation environment based on ServiceNow, encompassing over 4,000 business rules and 55 active workflows. We introduce the WoW-bench benchmark to evaluate agents’ capabilities in constrained task completion and dynamic system modeling. Our study reveals, for the first time, a “dynamic blind spot” of large models in enterprise settings and advocates for embodied world models that explicitly learn latent system states. Experiments demonstrate that agents equipped with latent state simulation significantly improve both task success rates and compliance with implicit constraints, establishing a new paradigm for reliable enterprise AI agents.

Technology Category

Application Category

📝 Abstract
Frontier large language models (LLMs) excel as autonomous agents in many domains, yet they remain untested in complex enterprise systems where hidden workflows create cascading effects across interconnected databases. Existing enterprise benchmarks evaluate surface-level agentic task completion similar to general consumer benchmarks, ignoring true challenges in enterprises, such as limited observability, large database state, and hidden workflows with cascading side effects. We introduce World of Workflows (WoW), a realistic ServiceNow-based environment incorporating 4,000+ business rules and 55 active workflows embedded in the system, alongside WoW-bench, a benchmark of 234 tasks evaluating constrained agentic task completion and enterprise dynamics modeling capabilities. We reveal two major takeaways: (1) Frontier LLMs suffer from dynamics blindness, consistently failing to predict the invisible, cascading side effects of their actions, which leads to silent constraint violations, and (2) reliability in opaque systems requires grounded world modeling, where agents must mentally simulate hidden state transitions to bridge the observability gap when high-fidelity feedback is unavailable. For reliable and useful enterprise agents, WoW motivates a new paradigm to explicitly learn system dynamics. We release our GitHub for setting up and evaluating WoW.
Problem

Research questions and friction points this paper is trying to address.

enterprise systems
hidden workflows
cascading side effects
limited observability
world models
Innovation

Methods, ideas, or system contributions that make the work stand out.

world models
enterprise workflows
LLM agents
observability gap
cascading side effects
🔎 Similar Papers
No similar papers found.