🤖 AI Summary
Deep exploration remains challenging in continuous control tasks with delayed rewards, where sparse feedback impedes effective policy learning. Method: This paper introduces PBAC—the first actor-critic algorithm grounded in PAC-Bayes theory—featuring (1) a Bayesian critic ensemble to quantify value-function uncertainty; (2) a shared-backbone multi-head soft actor architecture with an ε-greedy head-selection mechanism for principled policy-level exploration; and (3) a data-driven prior over the function space to enhance generalization and robustness. Contribution/Results: Evaluated on multiple high-difficulty continuous-control benchmarks with extreme reward delay, PBAC achieves stable discovery of delayed returns—outperforming state-of-the-art deep exploration methods. It establishes a novel paradigm for safe, sample-efficient exploration under delayed feedback, bridging theoretical generalization bounds with practical reinforcement learning performance.
📝 Abstract
Reinforcement learning for continuous control under delayed rewards is an under-explored problem despite its significance in real life. Many complex skills build on intermediate ones as prerequisites. For instance, a humanoid locomotor has to learn how to stand before it can learn to walk. To cope with delayed reward, a reinforcement learning agent has to perform deep exploration. However, existing deep exploration methods are designed for small discrete action spaces, and their successful generalization to state-of-the-art continuous control remains unproven. We address the deep exploration problem for the first time from a PAC-Bayesian perspective in the context of actor-critic learning. To do this, we quantify the error of the Bellman operator through a PAC-Bayes bound, where a bootstrapped ensemble of critic networks represents the posterior distribution, and their targets serve as a data-informed function-space prior. We derive an objective function from this bound and use it to train the critic ensemble. Each critic trains an individual soft actor network, implemented as a shared trunk and critic-specific heads. The agent performs deep exploration by acting epsilon-greedily on a randomly chosen actor head. Our proposed algorithm, named PAC-Bayesian Actor-Critic (PBAC), is the only algorithm to consistently discover delayed rewards on a diverse set of continuous control tasks with varying difficulty.