🤖 AI Summary
This work addresses the challenge of effectively estimating returns under a fixed horizon in non-episodic finite-horizon reinforcement learning, where conventional methods often struggle. To this end, the authors propose a K-step lookahead Q-function combined with a time-varying action-threshold mechanism, which balances planning depth against estimation variance. By adaptively tuning the lookahead horizon K, they design an efficient tabular learning algorithm and provide theoretical guarantees on its convergence and near-optimal regret bounds under finite-sample regimes. Empirical evaluations on benchmark environments—including JumpRiverswim, FrozenLake, and AnyTrading—demonstrate that the proposed method significantly outperforms existing state-of-the-art tabular RL algorithms in terms of cumulative reward, exhibiting superior sample efficiency and policy performance.
📝 Abstract
Online reinforcement learning in non-episodic, finite-horizon MDPs remains underexplored and is challenged by the need to estimate returns to a fixed terminal time. Existing infinite-horizon methods, which often rely on discounted contraction, do not naturally account for this fixed-horizon structure. We introduce a modified Q-function: rather than targeting the full-horizon, we learn a K-step lookahead Q-function that truncates planning to the next K steps. To further improve sample efficiency, we introduce a thresholding mechanism: actions are selected only when their estimated K-step lookahead value exceeds a time-varying threshold. We provide an efficient tabular learning algorithm for this novel objective, proving it achieves fast finite-sample convergence: it achieves minimax optimal constant regret for $K=1$ and $\mathcal{O}(\max((K-1),C_{K-1})\sqrt{SAT\log(T)})$ regret for any $K \geq 2$. We numerically evaluate the performance of our algorithm under the objective of maximizing reward. Our implementation adaptively increases K over time, balancing lookahead depth against estimation variance. Empirical results demonstrate superior cumulative rewards over state-of-the-art tabular RL methods across synthetic MDPs and RL environments: JumpRiverswim, FrozenLake and AnyTrading.