Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the challenge of learning collaborative policies in multi-agent systems under interface constraints, where agents can only access local observations and cannot obtain joint trajectories. To tackle this, we propose IC-Q, an asynchronous decentralized Q-learning algorithm that models cross-agent workflows as an Interface-Constrained Semi-Markov Decision Process (IC-SMDP), enabling efficient coordination through the exchange of a single scalar at each handoff. We provide the first finite-sample complexity guarantee for neural Q-learning in decentralized partially observable multi-agent SMDPs and extend the Approximate Information State (AIS) framework to settings with stochastic option durations and Markovian noise. Empirical results demonstrate that IC-Q matches the performance of a centralized oracle across synthetic tasks and real-world applications—including multi-LLM mathematical reasoning, multi-agent routing, and CPU programming—with all three error components scaling precisely as predicted by theory.

📝 Abstract

We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories -- the operating regime of multi-agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface-constrained semi-Markov decision process (IC-SMDP), whose decision epochs occur at handoff times, and design IC-$Q$, an asynchronous decentralized $Q$-learning algorithm in which cross-agent coordination at every handoff is exactly one scalar. Our main result is a finite-sample bound for neural IC-$Q$ that decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite-sample guarantee for neural $Q$-learning under decentralized partial observability. Four experiments: a controlled synthetic IC-SMDP that validates the bound term-by-term, multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming, show that IC-$Q$ matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.

Problem

Research questions and friction points this paper is trying to address.

workflow learning

interface constraints

decentralized partial observability

multi-agent systems

handoff control

Innovation

Methods, ideas, or system contributions that make the work stand out.

interface-constrained SMDP

decentralized Q-learning

finite-sample guarantee