Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parameteric Policies

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of offline reinforcement learning in large or continuous action spaces and the lack of theoretical guarantees for explicitly parameterized policies. By extending mirror descent to parameterized policies and establishing a connection with natural policy gradient, the proposed approach resolves contextual coupling across states and incorporates a pessimism principle to ensure stability. The study provides the first theoretical guarantees for general parameterized policies in offline RL, yielding a computationally tractable optimization framework that naturally supports continuous action spaces. Furthermore, it reveals an intrinsic unification between offline reinforcement learning and imitation learning, offering new insights into their algorithmic and theoretical connections.

Technology Category

Application Category

📝 Abstract
We investigate the theoretical aspects of offline reinforcement learning (RL) under general function approximation. While prior works (e.g., Xie et al., 2021) have established the theoretical foundations of learning a good policy from offline data via pessimism, existing algorithms that are computationally tractable (often in an oracle-efficient sense), such as PSPI, only apply to finite and small action spaces. Moreover, these algorithms rely on state-wise mirror descent and require actors to be implicitly induced from the critic functions, failing to accommodate standalone policy parameterization which is ubiquitous in practice. In this work, we address these limitations and extend the theoretical guarantees to parameterized policy classes over large or continuous action spaces. When extending mirror descent to parameterized policies, we identify contextual coupling as the core difficulty, and show how connecting mirror descent to natural policy gradient leads to novel analyses, guarantees, and algorithmic insights, including a surprising unification between offline RL and imitation learning.
Problem

Research questions and friction points this paper is trying to address.

offline reinforcement learning
parametric policies
large action spaces
mirror descent
policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

offline reinforcement learning
parameterized policies
mirror descent
natural policy gradient
imitation learning
🔎 Similar Papers
No similar papers found.