🤖 AI Summary
This work addresses the challenge of non-stationary environmental dynamics and rewards—such as drifts, oscillations, or abrupt policy shifts—in real-world reinforcement learning, which often induce policy jitter and tracking errors. Existing methods struggle to capture the local geometric structure of such non-stationarity. The authors model non-stationary discounted MDPs as differentiable homotopy paths and quantify the intrinsic complexity of environmental changes through path length, curvature, and inflection points along the trajectory of optimal Bellman fixed points. Leveraging this geometric characterization, they adaptively modulate learning and planning intensity. For the first time, stability bounds based on path integrals and a gap-aware safety-feasibility region are established from a geometric perspective, enabling formal certification of stability in policy-switching regions. The proposed lightweight algorithms, HT-RL and HT-MCTS, significantly outperform static baselines in oscillatory and high-switching scenarios, effectively reducing dynamic regret and improving policy tracking performance.
📝 Abstract
Real-world reinforcement learning is often \emph{nonstationary}: rewards and dynamics drift, accelerate, oscillate, and trigger abrupt switches in the optimal action. Existing theory often represents nonstationarity with coarse-scale models that measure \emph{how much} the environment changes, not \emph{how} it changes locally -- even though acceleration and near-ties drive tracking error and policy chattering. We take a geometric view of nonstationary discounted Markov Decision Processes (MDPs) by modeling the environment as a differentiable homotopy path and tracking the induced motion of the optimal Bellman fixed point. This yields a length-curvature-kink signature of intrinsic complexity: cumulative drift, acceleration/oscillation, and action-gap-induced nonsmoothness. We prove a solver-agnostic path-integral stability bound and derive gap-safe feasible regions that certify local stability away from switch regimes. Building on these results, we introduce \textit{Homotopy-Tracking RL (HT-RL)} and \textit{HT-MCTS}, lightweight wrappers that estimate replay-based proxies of length, curvature, and near-tie proximity online and adapt learning or planning intensity accordingly. Experiments show improved tracking and dynamic regret over matched static baselines, with the largest gains in oscillatory and switch-prone regimes.