🤖 AI Summary
In offline multi-agent reinforcement learning, factorized behavioral policies induce severe joint action distribution shift, substantially degrading out-of-distribution (OOD) policy evaluation performance. This work is the first to theoretically establish that such shift grows linearly with the number of deviating agents. To mitigate this, we propose Soft Partially Conservative Q-Learning (SPCQL): it alleviates distribution shift via a partial action replacement mechanism—fixing actions of a subset of agents—while dynamically weighting distinct replacement strategies using uncertainty estimates to yield a tighter value error bound. SPCQL integrates conservative Q-learning, epistemic uncertainty modeling, and adaptive weighting, all under rigorous theoretical guarantees. Empirical results demonstrate that SPCQL significantly outperforms existing baselines on independently structured offline datasets, achieving superior policy performance and enhanced training stability.
📝 Abstract
Offline multi-agent reinforcement learning (MARL) is severely hampered by the challenge of evaluating out-of-distribution (OOD) joint actions. Our core finding is that when the behavior policy is factorized - a common scenario where agents act fully or partially independently during data collection - a strategy of partial action replacement (PAR) can significantly mitigate this challenge. PAR updates a single or part of agents'actions while the others remain fixed to the behavioral data, reducing distribution shift compared to full joint-action updates. Based on this insight, we develop Soft-Partial Conservative Q-Learning (SPaCQL), using PAR to mitigate OOD issue and dynamically weighting different PAR strategies based on the uncertainty of value estimation. We provide a rigorous theoretical foundation for this approach, proving that under factorized behavior policies, the induced distribution shift scales linearly with the number of deviating agents rather than exponentially with the joint-action space. This yields a provably tighter value error bound for this important class of offline MARL problems. Our theoretical results also indicate that SPaCQL adaptively addresses distribution shift using uncertainty-informed weights. Our empirical results demonstrate SPaCQL enables more effective policy learning, and manifest its remarkable superiority over baseline algorithms when the offline dataset exhibits the independence structure.