🤖 AI Summary
Existing benchmarks struggle to capture the diversity of real-world occupational scenarios. To address this gap, this work proposes OccuBench—the first benchmark enabling systematic cross-industry evaluation of AI agents’ professional competencies, encompassing 65 domains and 100 authentic tasks. Its core innovation lies in a multi-agent synthetic pipeline grounded in Language World Models (LWMs), which automatically generates evaluable instances with controllable difficulty, diverse documentation, and guaranteed solvability. The framework further incorporates a controlled injection mechanism for explicit errors, implicit data degradation, and hybrid failures. Evaluation of 15 state-of-the-art models reveals distinct occupational capability profiles across models, with implicit failures posing the greatest challenge. Performance significantly improves with model scale, generational updates, and increased reasoning investment—evidenced by a 27.5-point gain for GPT-5.2—yet highly capable agents do not necessarily simulate environments effectively.
📝 Abstract
AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.