Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenge in model-based reinforcement learning where policies trained in simulation fail in the real world due to model prediction errors. To mitigate this, the authors propose shifting the model-learning objective from predictive accuracy to policy robustness by formulating a zero-sum minimax game between the dynamics model and an adversarial policy. Leveraging online learning theory guarantees, a critic-based simplified algorithm, and the Error-MDP duality, they design a provably convergent active data selection mechanism. Evaluated on continuous control tasks, the method reduces prediction errors in critical regions by 1.5–2.2×, enabling policies trained purely in simulation to achieve near-optimal performance when deployed in the real environment.

📝 Abstract

Model-based reinforcement learning (MBRL) agents typically learn world models by minimizing predictive loss. However, powerful RL optimizers inevitably exploit minor model inaccuracies, leading to simulator exploitation and a reality gap where policies succeed in simulation but fail in the real world. We propose that the objective for learning simulators should be strategic robustness rather than predictive accuracy, and formulate this as a zero-sum minimax game between a model player and an adversarial policy player. We provide a comprehensive theoretical analysis: (1) an online learning guarantee showing the game is learnable with sublinear regret bounds; (2) a tractable critic-based simplification bounding the global policy-value gap by the local critic's loss; and (3) an Error-MDP duality, proving that finding the worst-case policy is formally dual to a standard RL problem where the reward is the one-step critic error. This duality yields a provably convergent active data selection algorithm. Experiments on continuous control tasks demonstrate that our approach reduces prediction error in strategically important regions by $1.5$-$2.2\times$ and enables policies trained purely in simulation to match near-optimal real-world performance.

Problem

Research questions and friction points this paper is trying to address.

simulator exploitation

reality gap

model-based reinforcement learning

predictive accuracy

policy robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

strategic robustness

minimax game

model-based reinforcement learning