To Switch or Not to Switch? Balanced Policy Switching in Offline Reinforcement Learning

📅 2024-07-01

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

In offline reinforcement learning, policy switching must balance switching costs against long-term performance gains under constrained historical data. This paper presents the first systematic formalization of this trade-off, introducing a principled framework grounded in optimal transport theory and providing rigorous theoretical analysis of its properties. Building on this foundation, we propose Net Actor-Critic—a differentiable, end-to-end algorithm that jointly optimizes policy selection and switching decisions. Evaluated on Gymnasium-based robotic control tasks and SUMO-RL traffic signal regulation benchmarks, our method significantly outperforms existing offline RL baselines: it reduces switching frequency while simultaneously improving policy performance. These results empirically validate the synergistic benefit of co-optimizing switching efficiency and policy quality, demonstrating that explicit modeling of switching dynamics enhances both sample efficiency and deployment robustness in offline settings.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) -- finding the optimal behaviour (also referred to as policy) maximizing the collected long-term cumulative reward -- is among the most influential approaches in machine learning with a large number of successful applications. In several decision problems, however, one faces the possibility of policy switching -- changing from the current policy to a new one -- which incurs a non-negligible cost, and in the decision one is limited to using historical data without the availability for further online interaction. Despite the inevitable importance of this offline learning scenario, to our best knowledge, very little effort has been made to tackle the key problem of balancing between the gain and the cost of switching in a flexible and principled way. Leveraging ideas from the area of optimal transport, we initialize the systematic study of policy switching in offline RL. We establish fundamental properties and design a Net Actor-Critic algorithm for the proposed novel switching formulation. Numerical experiments demonstrate the efficiency of our approach on multiple robot control benchmarks of the Gymnasium and traffic light control from SUMO-RL.

Problem

Research questions and friction points this paper is trying to address.

Balancing policy switching costs

Offline reinforcement learning challenges

Optimal transport in RL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal transport for policy switching

Net Actor-Critic algorithm design

Offline reinforcement learning efficiency

🔎 Similar Papers

Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation