🤖 AI Summary
This work proposes JobBench, the first benchmark for AI occupational agents centered on human intent and job augmentation rather than economic value or human replacement. Built upon expert-identified high-priority delegable tasks, JobBench encompasses 35 professions and 130 real-world workplace tasks, each accompanied by heterogeneous reference materials and multidimensional scoring criteria. Leveraging a fact-anchored scoring chain mechanism—averaging 35.6 binary judgment points per task—evaluation across 36 models reveals that even the strongest current model, Claude Opus, achieves only 45.9 out of 100, underscoring a substantial gap between existing AI capabilities and the requirements of authentic professional workflows.
📝 Abstract
Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. JobBench covers 130 agentic tasks across 35 occupations. Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Outputs are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. We evaluate 36 models; the strongest, Claude Opus~4.7 under Claude Code, reaches only 45.9 %. We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.