🤖 AI Summary
In finite-horizon offline reinforcement learning, policy evaluation suffers from statistical bias due to the entanglement of future-policy dependence and distributional shift. Method: We propose a selective uncertainty propagation mechanism that adaptively quantifies the difficulty of distributional shift at each time step, decoupling policy iteration from distribution evolution in dynamic programming. By integrating causal inference—specifically treatment effect estimation—with uncertainty quantification, we construct tighter confidence intervals for value estimates. Contribution/Results: Our method jointly optimizes policy evaluation and learning under purely offline settings without online interaction. Experiments demonstrate significant improvements in evaluation robustness and statistical efficiency over state-of-the-art baselines; empirical validation in simulation environments confirms both effectiveness and generalizability.
📝 Abstract
We consider the finite-horizon offline reinforcement learning (RL) setting, and are motivated by the challenge of learning the policy at any step h in dynamic programming (DP) algorithms. To learn this, it is sufficient to evaluate the treatment effect of deviating from the behavioral policy at step h after having optimized the policy for all future steps. Since the policy at any step can affect next-state distributions, the related distributional shift challenges can make this problem far more statistically hard than estimating such treatment effects in the stochastic contextual bandit setting. However, the hardness of many real-world RL instances lies between the two regimes. We develop a flexible and general method called selective uncertainty propagation for confidence interval construction that adapts to the hardness of the associated distribution shift challenges. We show benefits of our approach on toy environments and demonstrate the benefits of these techniques for offline policy learning.