🤖 AI Summary
In online policy evaluation for reinforcement learning, outliers and heavy-tailed reward distributions severely degrade the accuracy of parameter and value function estimates. Method: We propose the first unified robust online statistical inference framework, introducing Bahadur-type expansions to temporal difference (TD) learning—enabling incrementally updated, asymptotically normal estimators—and integrating robust statistical estimation with online variance adaptation. Contribution/Results: Theoretically, we establish asymptotic normality of the estimator under both heavy-tailed rewards and adversarial contamination. Empirically, our method significantly improves estimation stability and confidence interval coverage in both synthetic and real-world RL tasks; it achieves over 40% higher robustness against interference compared to standard TD methods, providing a new paradigm for robust policy evaluation that is both theoretically grounded and computationally feasible.
📝 Abstract
Reinforcement learning has emerged as one of the prominent topics attracting attention in modern statistical learning, with policy evaluation being a key component. Unlike the traditional machine learning literature on this topic, our work emphasizes statistical inference for the model parameters and value functions of reinforcement learning algorithms. While most existing analyses assume random rewards to follow standard distributions, we embrace the concept of robust statistics in reinforcement learning by simultaneously addressing issues of outlier contamination and heavy-tailed rewards within a unified framework. In this paper, we develop a fully online robust policy evaluation procedure, and establish the Bahadur-type representation of our estimator. Furthermore, we develop an online procedure to efficiently conduct statistical inference based on the asymptotic distribution. This paper connects robust statistics and statistical inference in reinforcement learning, offering a more versatile and reliable approach to online policy evaluation. Finally, we validate the efficacy of our algorithm through numerical experiments conducted in simulations and real-world reinforcement learning experiments.