🤖 AI Summary
This work addresses a key limitation of existing efficient reinforcement learning algorithms in linearly Bellman-complete Markov decision processes (MDPs), which typically require either small action spaces or strong feature-based oracle assumptions. Focusing on such MDPs with deterministic transitions and stochastic initial states and rewards, the paper proposes the first end-to-end provably efficient algorithm that operates under the standard argmax action oracle and accommodates both finite and infinite action spaces. By integrating linear function approximation, the Bellman completeness assumption, and polynomial-time policy optimization, the method achieves sample and computational complexities that are polynomial in the horizon, feature dimension, and $1/\varepsilon$, thereby learning an $\varepsilon$-optimal policy.
📝 Abstract
We study reinforcement learning (RL) with linear function approximation in Markov Decision Processes (MDPs) satisfying \emph{linear Bellman completeness} -- a fundamental setting where the Bellman backup of any linear value function remains linear. While statistically tractable, prior computationally efficient algorithms are either limited to small action spaces or require strong oracle assumptions over the feature space. We provide a computationally efficient algorithm for linear Bellman complete MDPs with \emph{deterministic transitions}, stochastic initial states, and stochastic rewards. For finite action spaces, our algorithm is end-to-end efficient; for large or infinite action spaces, we require only a standard argmax oracle over actions. Our algorithm learns an $\varepsilon$-optimal policy with sample and computational complexity polynomial in the horizon, feature dimension, and $1/\varepsilon$.