🤖 AI Summary
This work investigates whether computationally efficient reinforcement learning algorithms exist for Markov decision processes (MDPs) with deterministic dynamics, large action spaces, stochastic initial states, and stochastic rewards, under the linear Bellman completeness framework. To address error amplification—a key challenge in value estimation—we propose the first computationally efficient (polynomial-time) optimistic value iteration algorithm: it injects structured random noise *only* into the null space of the training data during least-squares regression, yielding strictly optimistic value estimates without excessive conservatism. Our method integrates linear function approximation with optimistic value iteration. Theoretically, it achieves a regret bound of $ ilde{O}(sqrt{d^3 H^3 T})$, where $d$ is the feature dimension, $H$ the horizon, and $T$ the total time steps. This bound unifies classical settings—including linear MDPs and linear quadratic regulators (LQR)—and breaks computational bottlenecks in large-action-space regimes under statistical learnability assumptions.
📝 Abstract
We study computationally and statistically efficient Reinforcement Learning algorithms for the linear Bellman Complete setting. This setting uses linear function approximation to capture value functions and unifies existing models like linear Markov Decision Processes (MDP) and Linear Quadratic Regulators (LQR). While it is known from the prior works that this setting is statistically tractable, it remained open whether a computationally efficient algorithm exists. Our work provides a computationally efficient algorithm for the linear Bellman complete setting that works for MDPs with large action spaces, random initial states, and random rewards but relies on the underlying dynamics to be deterministic. Our approach is based on randomization: we inject random noise into least squares regression problems to perform optimistic value iteration. Our key technical contribution is to carefully design the noise to only act in the null space of the training data to ensure optimism while circumventing a subtle error amplification issue.