Sample Efficient Experience Replay in Non-stationary Environments

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

In non-stationary environments, rapid environmental dynamics cause historical experiences to become obsolete quickly, while conventional TD-error-based prioritized replay cannot distinguish between errors arising from policy updates and those induced by environmental shifts, thereby limiting learning efficiency. To address this, we propose DEER—a Dynamic Environment-adaptive Experience Replay framework. First, we formalize the Degree of Environment Change (DoE) to quantify environmental dynamics. Second, we design a classifier-based adaptive sampling mechanism that dynamically reweights experience priorities upon detecting environmental switches. Third, we integrate value-function discrepancy modeling with off-policy optimization to enable precise control over experience reuse. Evaluated on four standard non-stationary benchmarks, DEER achieves an average performance gain of 11.54% over the strongest baselines, significantly improving both sample efficiency and environmental adaptability.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) in non-stationary environments is challenging, as changing dynamics and rewards quickly make past experiences outdated. Traditional experience replay (ER) methods, especially those using TD-error prioritization, struggle to distinguish between changes caused by the agent's policy and those from the environment, resulting in inefficient learning under dynamic conditions. To address this challenge, we propose the Discrepancy of Environment Dynamics (DoE), a metric that isolates the effects of environment shifts on value functions. Building on this, we introduce Discrepancy of Environment Prioritized Experience Replay (DEER), an adaptive ER framework that prioritizes transitions based on both policy updates and environmental changes. DEER uses a binary classifier to detect environment changes and applies distinct prioritization strategies before and after each shift, enabling more sample-efficient learning. Experiments on four non-stationary benchmarks demonstrate that DEER further improves the performance of off-policy algorithms by 11.54 percent compared to the best-performing state-of-the-art ER methods.

Problem

Research questions and friction points this paper is trying to address.

Addresses reinforcement learning challenges in non-stationary environments

Proposes metric to isolate environment shift effects on value functions

Develops adaptive experience replay framework for dynamic conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

DEER prioritizes transitions using environment-policy changes

Uses binary classifier to detect environmental shifts

Applies distinct strategies before and after changes

🔎 Similar Papers

Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate