Sample-Efficient Policy Space Response Oracles with Joint Experience Best Response

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the poor sample efficiency of Policy Space Response Oracle (PSRO) in multi-agent reinforcement learning, which stems from training best responses agent-by-agent and hinders scalability in large-scale or high-simulation-cost settings. To overcome this limitation, the authors propose the Joint Best Response (JBR) framework, which collects trajectories once under the current meta-strategy and reuses them across all agents to compute best responses, thereby recasting the problem as offline reinforcement learning to improve sample efficiency. Three enhancements—Conservative JBR, Exploration-Augmented JBR, and Hybrid BR—are further introduced to mitigate distributional shift while preserving equilibrium robustness and substantially reducing sample consumption. Experiments demonstrate that Exploration-Augmented JBR achieves the best accuracy-efficiency trade-off across multiple benchmark environments, and Hybrid BR closely matches the performance of standard PSRO with only a small number of samples.

Technology Category

Application Category

📝 Abstract

Multi-agent reinforcement learning (MARL) offers a scalable alternative to exact game-theoretic analysis but suffers from non-stationarity and the need to maintain diverse populations of strategies that capture non-transitive interactions. Policy Space Response Oracles (PSRO) address these issues by iteratively expanding a restricted game with approximate best responses (BRs), yet per-agent BR training makes it prohibitively expensive in many-agent or simulator-expensive settings. We introduce Joint Experience Best Response (JBR), a drop-in modification to PSRO that collects trajectories once under the current meta-strategy profile and reuses this joint dataset to compute BRs for all agents simultaneously. This amortizes environment interaction and improves the sample efficiency of best-response computation. Because JBR converts BR computation into an offline RL problem, we propose three remedies for distribution-shift bias: (i) Conservative JBR with safe policy improvement, (ii) Exploration-Augmented JBR that perturbs data collection and admits theoretical guarantees, and (iii) Hybrid BR that interleaves JBR with periodic independent BR updates. Across benchmark multi-agent environments, Exploration-Augmented JBR achieves the best accuracy-efficiency trade-off, while Hybrid BR attains near-PSRO performance at a fraction of the sample cost. Overall, JBR makes PSRO substantially more practical for large-scale strategic learning while preserving equilibrium robustness.

Problem

Research questions and friction points this paper is trying to address.

multi-agent reinforcement learning

sample efficiency

Policy Space Response Oracles

best response

non-stationarity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint Experience Best Response

Policy Space Response Oracles

Sample Efficiency