Geometry of Drifting MDPs with Path-Integral Stability Certificates

📅 2026-01-29

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of non-stationary environmental dynamics and rewards—such as drifts, oscillations, or abrupt policy shifts—in real-world reinforcement learning, which often induce policy jitter and tracking errors. Existing methods struggle to capture the local geometric structure of such non-stationarity. The authors model non-stationary discounted MDPs as differentiable homotopy paths and quantify the intrinsic complexity of environmental changes through path length, curvature, and inflection points along the trajectory of optimal Bellman fixed points. Leveraging this geometric characterization, they adaptively modulate learning and planning intensity. For the first time, stability bounds based on path integrals and a gap-aware safety-feasibility region are established from a geometric perspective, enabling formal certification of stability in policy-switching regions. The proposed lightweight algorithms, HT-RL and HT-MCTS, significantly outperform static baselines in oscillatory and high-switching scenarios, effectively reducing dynamic regret and improving policy tracking performance.

Technology Category

Application Category

📝 Abstract

Real-world reinforcement learning is often \emph{nonstationary}: rewards and dynamics drift, accelerate, oscillate, and trigger abrupt switches in the optimal action. Existing theory often represents nonstationarity with coarse-scale models that measure \emph{how much} the environment changes, not \emph{how} it changes locally -- even though acceleration and near-ties drive tracking error and policy chattering. We take a geometric view of nonstationary discounted Markov Decision Processes (MDPs) by modeling the environment as a differentiable homotopy path and tracking the induced motion of the optimal Bellman fixed point. This yields a length-curvature-kink signature of intrinsic complexity: cumulative drift, acceleration/oscillation, and action-gap-induced nonsmoothness. We prove a solver-agnostic path-integral stability bound and derive gap-safe feasible regions that certify local stability away from switch regimes. Building on these results, we introduce \textit{Homotopy-Tracking RL (HT-RL)} and \textit{HT-MCTS}, lightweight wrappers that estimate replay-based proxies of length, curvature, and near-tie proximity online and adapt learning or planning intensity accordingly. Experiments show improved tracking and dynamic regret over matched static baselines, with the largest gains in oscillatory and switch-prone regimes.

Problem

Research questions and friction points this paper is trying to address.

nonstationary MDPs

environmental drift

policy chattering

dynamic regret

action gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

nonstationary MDPs

homotopy path

path-integral stability

curvature-length-kink signature

gap-safe regions

🔎 Similar Papers

No similar papers found.

Authors to Follow