🤖 AI Summary
Offline multi-agent reinforcement learning suffers from data sparsity and out-of-distribution actions due to the exponential growth of the joint action space. This work proposes the PLCQL framework, which introduces, for the first time, a state-adaptive Partial Action Replacement (PAR) strategy: it formulates the selection of the PAR subset as a contextual bandit problem and employs Proximal Policy Optimization to dynamically determine the number of agents to replace at each step, thereby balancing policy improvement with conservative value estimation. This approach reduces the number of Q-function evaluations per iteration from n to 1 and provides a theoretical error bound that grows linearly with deviation from the desired replacement count. Evaluated on MPE, MaMuJoCo, and SMAC benchmarks, PLCQL achieves the highest normalized score in 66% of tasks, outperforms SPaCQL in 84% of tasks, and significantly lowers computational overhead.
📝 Abstract
Offline multi-agent reinforcement learning (MARL) faces a critical challenge: the joint action space grows exponentially with the number of agents, making dataset coverage exponentially sparse and out-of-distribution (OOD) joint actions unavoidable. Partial Action Replacement (PAR) mitigates this by anchoring a subset of agents to dataset actions, but existing approach relies on enumerating multiple subset configurations at high computational cost and cannot adapt to varying states. We introduce PLCQL, a framework that formulates PAR subset selection as a contextual bandit problem and learns a state-dependent PAR policy using Proximal Policy Optimisation with an uncertainty-weighted reward. This adaptive policy dynamically determines how many agents to replace at each update step, balancing policy improvement against conservative value estimation. We prove a value-error bound showing that the estimation error scales linearly with the expected number of deviating agents. Compared with the previous PAR-based method SPaCQL, PLCQL reduces the number of per-iteration Q-function evaluations from n to 1, significantly improving computational efficiency. Empirically, PLCQL achieves the highest normalised scores on 66% of tasks across MPE, MaMuJoCo, and SMAC benchmarks, outperforming SPaCQL on 84% of tasks while substantially reducing computational cost.