π€ AI Summary
In non-stationary environments, rapid environmental dynamics cause historical experiences to become obsolete quickly, while conventional TD-error-based prioritized replay cannot distinguish between errors arising from policy updates and those induced by environmental shifts, thereby limiting learning efficiency. To address this, we propose DEERβa Dynamic Environment-adaptive Experience Replay framework. First, we formalize the Degree of Environment Change (DoE) to quantify environmental dynamics. Second, we design a classifier-based adaptive sampling mechanism that dynamically reweights experience priorities upon detecting environmental switches. Third, we integrate value-function discrepancy modeling with off-policy optimization to enable precise control over experience reuse. Evaluated on four standard non-stationary benchmarks, DEER achieves an average performance gain of 11.54% over the strongest baselines, significantly improving both sample efficiency and environmental adaptability.
π Abstract
Reinforcement learning (RL) in non-stationary environments is challenging, as changing dynamics and rewards quickly make past experiences outdated. Traditional experience replay (ER) methods, especially those using TD-error prioritization, struggle to distinguish between changes caused by the agent's policy and those from the environment, resulting in inefficient learning under dynamic conditions. To address this challenge, we propose the Discrepancy of Environment Dynamics (DoE), a metric that isolates the effects of environment shifts on value functions. Building on this, we introduce Discrepancy of Environment Prioritized Experience Replay (DEER), an adaptive ER framework that prioritizes transitions based on both policy updates and environmental changes. DEER uses a binary classifier to detect environment changes and applies distinct prioritization strategies before and after each shift, enabling more sample-efficient learning. Experiments on four non-stationary benchmarks demonstrate that DEER further improves the performance of off-policy algorithms by 11.54 percent compared to the best-performing state-of-the-art ER methods.