Tight Sample Complexity Bounds for Entropic Best Policy Identification

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

This work addresses the problem of identifying optimal policies in finite-horizon risk-sensitive reinforcement learning under entropic risk measures. By integrating a forward model with a KL-divergence-based exploration bonus, the paper introduces two key technical contributions: first, leveraging the smoothness of the exponential utility function to derive sharper concentration inequalities; second, designing an adaptive stopping rule tailored to this tighter bound. These innovations collectively reduce the sample complexity upper bound from $O(e^{2|\beta|H})$ to $O(e^{|\beta|H})$, thereby matching the known lower bound $\Omega(e^{|\beta|H})$ for the first time and fully closing the longstanding exponential gap in the literature.

📝 Abstract

We study best-policy identification for finite-horizon risk-sensitive reinforcement learning under the entropic risk measure. Recent work established a constant gap in the exponential horizon dependence between lower and upper bounds on the number of samples required to identify an approximately optimal policy. Precisely, known lower bounds scale in $Ω(e^{|β| H})$ where $H$ is the horizon of the MDP, while the state-of-the-art upper bound achieves at best $O(e^{2|β| H})$ (arXiv:2506.00286v2) using a generative model. We show that this extra exponential factor can be traced to overly loose concentration control for exponential utilities. To close this open gap, we revisit the analysis of this problem through a forward-model based algorithm building on KL-based exploration bonuses that we adapt to the entropic criterion. The improvement we get is due to two main novel technical innovations. We leverage the smoothness properties of the exponential utility to derive sharper concentration bounds, and we propose a new stopping rule that exploits further this tightness to obtain a sample complexity that matches the lower bound.

Problem

Research questions and friction points this paper is trying to address.

sample complexity

entropic risk

best-policy identification

risk-sensitive reinforcement learning

finite-horizon MDP

Innovation

Methods, ideas, or system contributions that make the work stand out.

entropic risk

sample complexity

risk-sensitive reinforcement learning