DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Current evaluation frameworks lack standardized metrics for assessing agents’ ability to spontaneously delegate tasks over extended time horizons. This work introduces the first benchmarking framework tailored to emergent delegation, integrating a fixed task suite (GAIA, tau-bench, BFCL), a peer pool of 11 large language models, a deterministic skill annotation layer, and a delegation interface supporting dynamic invocation and agent profiling. The framework further incorporates counterfactual delegation ceilings and multi-axis evaluation metrics. Experiments across 23,375 task instances reveal that current approaches achieve delegation fidelity-at-1 rates of only 7.5%–29.5%, while counterfactual analysis indicates potential performance gains of 15–31 percentage points, highlighting substantial room for improvement.

📝 Abstract

We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.

Problem

Research questions and friction points this paper is trying to address.

emergent delegation

long-horizon agentic workflows

task routing

multi-model orchestration

delegation benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

emergent delegation

agentic workflows

routing fidelity