π€ AI Summary
Current evaluation frameworks lack standardized metrics for assessing agentsβ ability to spontaneously delegate tasks over extended time horizons. This work introduces the first benchmarking framework tailored to emergent delegation, integrating a fixed task suite (GAIA, tau-bench, BFCL), a peer pool of 11 large language models, a deterministic skill annotation layer, and a delegation interface supporting dynamic invocation and agent profiling. The framework further incorporates counterfactual delegation ceilings and multi-axis evaluation metrics. Experiments across 23,375 task instances reveal that current approaches achieve delegation fidelity-at-1 rates of only 7.5%β29.5%, while counterfactual analysis indicates potential performance gains of 15β31 percentage points, highlighting substantial room for improvement.
π Abstract
We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.