🤖 AI Summary
This work addresses the challenge of learning collaborative policies in multi-agent systems under interface constraints, where agents can only access local observations and cannot obtain joint trajectories. To tackle this, we propose IC-Q, an asynchronous decentralized Q-learning algorithm that models cross-agent workflows as an Interface-Constrained Semi-Markov Decision Process (IC-SMDP), enabling efficient coordination through the exchange of a single scalar at each handoff. We provide the first finite-sample complexity guarantee for neural Q-learning in decentralized partially observable multi-agent SMDPs and extend the Approximate Information State (AIS) framework to settings with stochastic option durations and Markovian noise. Empirical results demonstrate that IC-Q matches the performance of a centralized oracle across synthetic tasks and real-world applications—including multi-LLM mathematical reasoning, multi-agent routing, and CPU programming—with all three error components scaling precisely as predicted by theory.
📝 Abstract
We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories -- the operating regime of multi-agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface-constrained semi-Markov decision process (IC-SMDP), whose decision epochs occur at handoff times, and design IC-$Q$, an asynchronous decentralized $Q$-learning algorithm in which cross-agent coordination at every handoff is exactly one scalar. Our main result is a finite-sample bound for neural IC-$Q$ that decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite-sample guarantee for neural $Q$-learning under decentralized partial observability. Four experiments: a controlled synthetic IC-SMDP that validates the bound term-by-term, multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming, show that IC-$Q$ matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.