Online Policy Learning via a Self-Normalized Maximal Inequality

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

In online policy learning, adaptive experimentation induces data dependence, violating the i.i.d. assumption and invalidating classical concentration inequalities—thus undermining standard statistical guarantees. To address this, we develop a novel self-normalized martingale empirical process maximal inequality and construct the first variance-regularized pessimistic policy learning framework applicable to general dependent data. Crucially, we introduce adaptive sample-variance penalization into pessimistic optimization—enabling fast convergence under sequential updates. Theoretically, we establish an excess risk bound that strictly improves upon the standard $1/sqrt{n}$ rate in both parametric and nonparametric settings. Numerical experiments confirm substantial gains in both convergence speed and policy performance over existing methods.

Technology Category

Application Category

📝 Abstract

Adaptive experiments produce dependent data that break i.i.d. assumptions that underlie classical concentration bounds and invalidate standard learning guarantees. In this paper, we develop a self-normalized maximal inequality for martingale empirical processes. Building on this, we first propose an adaptive sample-variance penalization procedure which balances empirical loss and sample variance, valid for general dependent data. Next, this allows us to derive a new variance-regularized pessimistic off-policy learning objective, for which we establish excess-risk guarantees. Subsequently, we show that, when combined with sequential updates and under standard complexity and margin conditions, the resulting estimator achieves fast convergence rates in both parametric and nonparametric regimes, improving over the usual $1/sqrt{n}$ baseline. We complement our theoretical findings with numerical simulations that illustrate the practical gains of our approach.

Problem

Research questions and friction points this paper is trying to address.

Develops self-normalized maximal inequality for martingale empirical processes

Proposes adaptive variance penalization for general dependent data

Derives variance-regularized off-policy learning with improved convergence rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-normalized maximal inequality for martingale processes

Adaptive sample-variance penalization for dependent data

Variance-regularized pessimistic off-policy learning objective

🔎 Similar Papers

Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning