🤖 AI Summary
This paper addresses the robustness challenge of offline reinforcement learning (RL) in non-stationary environments—specifically, abrupt, time-varying, and potentially non-Markovian state shifts in real-world settings that induce partial observability and invalidate initial policies. To tackle this, we propose the first unified framework comprising three key components: (1) a conditional diffusion model to generate multi-hypothesis candidate states, mitigating state misidentification; (2) zero-shot time-series foundation models for forward environmental dynamics prediction, enabling non-Markovian disturbance modeling without prior assumptions on shift patterns; and (3) end-to-end integration of state inference, dynamics prediction, and offline policy learning. Evaluated on the first realistic time-series-augmented offline RL benchmark explicitly designed for non-stationarity, our method significantly outperforms state-of-the-art baselines, achieving robust policy performance from episode onset.
📝 Abstract
Offline Reinforcement Learning (RL) provides a promising avenue for training policies from pre-collected datasets when gathering additional interaction data is infeasible. However, existing offline RL methods often assume stationarity or only consider synthetic perturbations at test time, assumptions that often fail in real-world scenarios characterized by abrupt, time-varying offsets. These offsets can lead to partial observability, causing agents to misperceive their true state and degrade performance. To overcome this challenge, we introduce Forecasting in Non-stationary Offline RL (FORL), a framework that unifies (i) conditional diffusion-based candidate state generation, trained without presupposing any specific pattern of future non-stationarity, and (ii) zero-shot time-series foundation models. FORL targets environments prone to unexpected, potentially non-Markovian offsets, requiring robust agent performance from the onset of each episode. Empirical evaluations on offline RL benchmarks, augmented with real-world time-series data to simulate realistic non-stationarity, demonstrate that FORL consistently improves performance compared to competitive baselines. By integrating zero-shot forecasting with the agent's experience, we aim to bridge the gap between offline RL and the complexities of real-world, non-stationary environments.