World of Workflows: A Benchmark for Bringing World Models to Enterprise Systems

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work addresses the challenge that state-of-the-art large language models struggle to model latent enterprise workflows and their cascading side effects, often violating implicit constraints due to limited observability. To bridge this gap, we construct World of Workflows (WoW), a realistic enterprise simulation environment based on ServiceNow, encompassing over 4,000 business rules and 55 active workflows. We introduce the WoW-bench benchmark to evaluate agents’ capabilities in constrained task completion and dynamic system modeling. Our study reveals, for the first time, a “dynamic blind spot” of large models in enterprise settings and advocates for embodied world models that explicitly learn latent system states. Experiments demonstrate that agents equipped with latent state simulation significantly improve both task success rates and compliance with implicit constraints, establishing a new paradigm for reliable enterprise AI agents.

Technology Category

Application Category

📝 Abstract

Frontier large language models (LLMs) excel as autonomous agents in many domains, yet they remain untested in complex enterprise systems where hidden workflows create cascading effects across interconnected databases. Existing enterprise benchmarks evaluate surface-level agentic task completion similar to general consumer benchmarks, ignoring true challenges in enterprises, such as limited observability, large database state, and hidden workflows with cascading side effects. We introduce World of Workflows (WoW), a realistic ServiceNow-based environment incorporating 4,000+ business rules and 55 active workflows embedded in the system, alongside WoW-bench, a benchmark of 234 tasks evaluating constrained agentic task completion and enterprise dynamics modeling capabilities. We reveal two major takeaways: (1) Frontier LLMs suffer from dynamics blindness, consistently failing to predict the invisible, cascading side effects of their actions, which leads to silent constraint violations, and (2) reliability in opaque systems requires grounded world modeling, where agents must mentally simulate hidden state transitions to bridge the observability gap when high-fidelity feedback is unavailable. For reliable and useful enterprise agents, WoW motivates a new paradigm to explicitly learn system dynamics. We release our GitHub for setting up and evaluating WoW.

Problem

Research questions and friction points this paper is trying to address.

enterprise systems

hidden workflows

cascading side effects

limited observability

world models

Innovation

Methods, ideas, or system contributions that make the work stand out.

world models

enterprise workflows

LLM agents