Adaptive Estimation and Optimal Control in Offline Contextual MDPs without Stationarity

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
This work addresses the challenges of robust estimation and optimal control in nonstationary offline contextual Markov decision processes (MDPs), where model irregularities hinder reliable inference. To tackle this, the paper proposes an adaptive estimation and policy optimization framework grounded in T-estimation (Baraud, 2011). Under fully general assumptions, it establishes the first adaptive density estimator for nonstationary offline contextual MDPs with strong optimality guarantees, deriving oracle risk bounds under two distinct loss functions. The learned policy further achieves theoretically optimal estimation accuracy and cost-performance guarantees in finite samples, thereby attaining joint optimality in both estimation and control.
📝 Abstract
Contextual MDPs are powerful tools with wide applicability in areas from biostatistics to machine learning. However, specializing them to offline datasets has been challenging due to a lack of robust, theoretically backed methods. Our work tackles this problem by introducing a new approach towards adaptive estimation and cost optimization of contextual MDPs. This estimator, to the best of our knowledge, is the first of its kind, and is endowed with strong optimality guarantees. We achieve this by overcoming the key technical challenges evolving from the endogenous properties of contextual MDPs; such as non-stationarity, or model irregularity. Our guarantees are established under complete generality by utilizing the relatively recent and powerful statistical technique of $T$-estimation (Baraud, 2011). We first provide a procedure for selecting an estimator given a sample from a contextual MDP and use it to derive oracle risk bounds under two distinct, but nevertheless meaningful, loss functions. We then consider the problem of determining the optimal control with the aid of the aforementioned density estimate and provide finite sample guarantees for the cost function.
Problem

Research questions and friction points this paper is trying to address.

Contextual MDPs
offline learning
non-stationarity
adaptive estimation
optimal control
Innovation

Methods, ideas, or system contributions that make the work stand out.

contextual MDPs
offline reinforcement learning
adaptive estimation
non-stationarity
T-estimation