🤖 AI Summary
This work addresses the challenge of statistical inference for controlled Markov transition kernels in offline reinforcement learning when the behavior policy is unknown—particularly when it is non-stationary or history-dependent. The authors propose a model-based bootstrap method that, for the first time, establishes consistency theory for the bootstrap distribution under such policies. By combining a bootstrap law of large numbers for visitation counts with a martingale central limit theorem for transition increments, and by verifying the Hadamard differentiability of the Bellman operator, they employ the delta method to construct asymptotically valid confidence intervals for value and Q-functions. In RiverSwim experiments, the proposed percentile bootstrap intervals substantially outperform existing approaches under small-sample and short-horizon settings, achieving coverage probabilities close to nominal levels (50%, 90%, 95%), whereas baseline methods exhibit poor calibration.
📝 Abstract
We propose and analyze a model-based bootstrap for transition kernels in finite controlled Markov chains (CMCs) with possibly nonstationary or history-dependent control policies, a setting that arises naturally in offline reinforcement learning (RL) when the behavior policy generating the data is unknown. We establish distributional consistency of the bootstrap transition estimator in both a single long-chain regime and the episodic offline RL regime. The key technical tools are a novel bootstrap law of large numbers (LLN) for the visitation counts and a novel use of the martingale central limit theorem (CLT) for the bootstrap transition increments. We extend bootstrap distributional consistency to the downstream targets of offline policy evaluation (OPE) and optimal policy recovery (OPR) via the delta method by verifying Hadamard differentiability of the Bellman operators, yielding asymptotically valid confidence intervals for value and $Q$-functions. Experiments on the RiverSwim problem show that the proposed bootstrap confidence intervals (CIs), especially the percentile CIs, outperform the episodic bootstrap and plug-in CLT CIs, and are often close to nominal ($50\%$, $90\%$, $95\%$) coverage, while the baselines are poorly calibrated at small sample sizes and short episode lengths.