🤖 AI Summary
This paper investigates the theoretical foundations of statistical complexity and the boundaries of online decision-making empowerment in offline reinforcement learning (RL). Focusing on offline contextual bandits and Markov decision processes (MDPs) with function approximation, it introduces a unified framework for characterizing behavior policy coverage. For the first time, it precisely quantifies the performance limits of value function classes via pseudo-dimension. Leveraging tools from statistical learning theory and minimax risk analysis, the work establishes tight (nearly) minimax-optimal sample complexity bounds. Key contributions are: (1) a rigorous unification of existing data coverage definitions; (2) the first pseudo-dimension–driven characterization of fundamental performance limits; and (3) a quantitative characterization of the asymptotic benefit boundary—i.e., the maximal online decision improvement achievable from offline data. These results provide a new theoretical benchmark for algorithm design and data utility evaluation in offline RL.
📝 Abstract
We study the statistical complexity of offline decision-making with function approximation, establishing (near) minimax-optimal rates for stochastic contextual bandits and Markov decision processes. The performance limits are captured by the pseudo-dimension of the (value) function class and a new characterization of the behavior policy that emph{strictly} subsumes all the previous notions of data coverage in the offline decision-making literature. In addition, we seek to understand the benefits of using offline data in online decision-making and show nearly minimax-optimal rates in a wide range of regimes.