From Generative to Episodic: Sample-Efficient Replicable Reinforcement Learning

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the trade-off between reproducibility and sample efficiency in reinforcement learning: In low-horizon tabular MDPs, does reproducible exploration necessarily incur significantly higher sample complexity than batch learning? The authors propose the first reproducible RL algorithm without access to a generative model. Leveraging reproducible randomness, state-action coverage under parallel sampling assumptions, and low-bias estimation, their algorithm achieves sample complexity $ ilde{O}(S^2A)$. They further establish a matching lower bound $ ilde{Omega}(S^2A)$, proving near-optimality in the state-space dimension $S$. This is the first result to bridge the theoretical gap between generative and episodic settings, demonstrating that reproducible exploration need not compromise sample efficiency.

Technology Category

Application Category

📝 Abstract
The epidemic failure of replicability across empirical science and machine learning has recently motivated the formal study of replicable learning algorithms [Impagliazzo et al. (2022)]. In batch settings where data comes from a fixed i.i.d. source (e.g., hypothesis testing, supervised learning), the design of data-efficient replicable algorithms is now more or less understood. In contrast, there remain significant gaps in our knowledge for control settings like reinforcement learning where an agent must interact directly with a shifting environment. Karbasi et. al show that with access to a generative model of an environment with $S$ states and $A$ actions (the RL 'batch setting'), replicably learning a near-optimal policy costs only $ ilde{O}(S^2A^2)$ samples. On the other hand, the best upper bound without a generative model jumps to $ ilde{O}(S^7 A^7)$ [Eaton et al. (2024)] due to the substantial difficulty of environment exploration. This gap raises a key question in the broader theory of replicability: Is replicable exploration inherently more expensive than batch learning? Is sample-efficient replicable RL even possible? In this work, we (nearly) resolve this problem (for low-horizon tabular MDPs): exploration is not a significant barrier to replicable learning! Our main result is a replicable RL algorithm on $ ilde{O}(S^2A)$ samples, bridging the gap between the generative and episodic settings. We complement this with a matching $ ildeΩ(S^2A)$ lower bound in the generative setting (under the common parallel sampling assumption) and an unconditional lower bound in the episodic setting of $ ildeΩ(S^2)$ showcasing the near-optimality of our algorithm with respect to the state space $S$.
Problem

Research questions and friction points this paper is trying to address.

Bridging sample efficiency gap between generative and episodic RL settings
Exploring if replicable RL is feasible without generative models
Determining cost of replicable exploration versus batch learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replicable RL algorithm with O(S²A) samples
Bridges gap between generative and episodic settings
Near-optimal performance in state space S
🔎 Similar Papers
No similar papers found.