Learning to Orchestrate Agents under Uncertainty

πŸ“… 2026-05-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes an adaptive task delegation method for environments with dynamic and heterogeneous agents, whose reliability, cost, and output quality vary. The coordination problem is formulated as a regularized multi-armed bandit, where, for the first time, uncertainty in agents’ output distributions is explicitly modeled at the coordination layer. A regularization term based on optimal transport (OT) distance is introduced to quantify alignment between an agent’s output distribution and a task-specific reference distribution. This enables the method to distinguish between agents with identical mean rewards but differing distributional fidelity, thereby supporting decisions that jointly account for reliability, cost, and uncertainty. Theoretical analysis yields a regret bound of order √T, and experiments demonstrate that the approach significantly outperforms standard bandit algorithms and heuristic baselines in synthetic yet adversarial non-i.i.d. task allocation settings.
πŸ“ Abstract
Adaptive orchestration of heterogeneous agents requires making sequential delegation decisions under uncertain and evolving agent behaviour, e.g., coordinating specialised AI models with varying reliability, cost, and response quality. While prior work on agent orchestration focuses on performance or cost, uncertainty in agent reliability and output distributions is typically not modelled explicitly at the orchestration level. In this work, we study the problem of adaptive orchestration of heterogeneous agents under uncertainty, where a meta-controller must decide when to delegate to an agent, accounting for reliability, cost, and uncertainty. We propose BOT-Orch, a lightweight framework that recasts orchestration as a bandit problem over agents, regularized by OT distances between agent output distributions and task-specific reference distributions. We show that the regularised orchestration enjoys $\mathcal{O}(\sqrt{T})$ regret under standard assumptions, and provably induces preference ordering among agents with identical mean rewards but differing distributional alignment. Empirically, we demonstrate that BOT-Orch outperforms standard bandit and heuristic baselines in synthetic but adversarial task allocation settings with heterogeneous, non-i.i.d. agent behaviour.
Problem

Research questions and friction points this paper is trying to address.

agent orchestration
uncertainty
heterogeneous agents
adaptive delegation
reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

agent orchestration
uncertainty modeling
optimal transport
bandit algorithms
distributional alignment
M
Mary Chriselda Antony Oliver
Department of Applied Mathematics and Theoretical Physics, University of Cambridge
L
Lan Jiang
Centre for Human-Inspired Artificial Intelligence, University of Cambridge
A
Aaron Bundi Anampiu
African Institute for Mathematical Sciences, South Africa
E
Elaf Almahmoud
Centre for Human-Inspired Artificial Intelligence, University of Cambridge
Francesco Quinzan
Francesco Quinzan
University of Oxford
Umang Bhatt
Umang Bhatt
University of Cambridge
Machine LearningArtificial IntelligenceHuman-AI Collaboration