🤖 AI Summary
Reinforcement learning (RL) policies often fail when deployed in real-world environments due to train-test distributional shift. Conventional approaches fix the robustness budget ε, yet this static choice inherently trades off nominal performance against robustness: overly small ε yields insufficient robustness, while excessively large ε induces over-conservatism or instability.
Method: We propose an adaptive robustness budget curriculum learning framework, modeling ε as a continuous, learnable curriculum variable. The uncertainty set is dynamically expanded during training to progressively increase robustness requirements. Our method integrates distributionally robust optimization, self-paced learning, and adversarial worst-case training.
Contribution/Results: Experiments across diverse tasks show an average 11.8% improvement in episode return—reaching 1.9× that of baseline algorithms. The approach significantly alleviates the robustness–performance trade-off and, for the first time, enables end-to-end curriculum scheduling of the robustness budget.
📝 Abstract
A central challenge in reinforcement learning is that policies trained in controlled environments often fail under distribution shifts at deployment into real-world environments. Distributionally Robust Reinforcement Learning (DRRL) addresses this by optimizing for worst-case performance within an uncertainty set defined by a robustness budget $epsilon$. However, fixing $epsilon$ results in a tradeoff between performance and robustness: small values yield high nominal performance but weak robustness, while large values can result in instability and overly conservative policies. We propose Distributionally Robust Self-Paced Curriculum Reinforcement Learning (DR-SPCRL), a method that overcomes this limitation by treating $epsilon$ as a continuous curriculum. DR-SPCRL adaptively schedules the robustness budget according to the agent's progress, enabling a balance between nominal and robust performance. Empirical results across multiple environments demonstrate that DR-SPCRL not only stabilizes training but also achieves a superior robustness-performance trade-off, yielding an average 11.8% increase in episodic return under varying perturbations compared to fixed or heuristic scheduling strategies, and achieving approximately 1.9$ imes$ the performance of the corresponding nominal RL algorithms.