Deep Exploration with PAC-Bayes

📅 2024-02-05

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Deep exploration remains challenging in continuous control tasks with delayed rewards, where sparse feedback impedes effective policy learning. Method: This paper introduces PBAC—the first actor-critic algorithm grounded in PAC-Bayes theory—featuring (1) a Bayesian critic ensemble to quantify value-function uncertainty; (2) a shared-backbone multi-head soft actor architecture with an ε-greedy head-selection mechanism for principled policy-level exploration; and (3) a data-driven prior over the function space to enhance generalization and robustness. Contribution/Results: Evaluated on multiple high-difficulty continuous-control benchmarks with extreme reward delay, PBAC achieves stable discovery of delayed returns—outperforming state-of-the-art deep exploration methods. It establishes a novel paradigm for safe, sample-efficient exploration under delayed feedback, bridging theoretical generalization bounds with practical reinforcement learning performance.

Technology Category

Application Category

📝 Abstract

Reinforcement learning for continuous control under delayed rewards is an under-explored problem despite its significance in real life. Many complex skills build on intermediate ones as prerequisites. For instance, a humanoid locomotor has to learn how to stand before it can learn to walk. To cope with delayed reward, a reinforcement learning agent has to perform deep exploration. However, existing deep exploration methods are designed for small discrete action spaces, and their successful generalization to state-of-the-art continuous control remains unproven. We address the deep exploration problem for the first time from a PAC-Bayesian perspective in the context of actor-critic learning. To do this, we quantify the error of the Bellman operator through a PAC-Bayes bound, where a bootstrapped ensemble of critic networks represents the posterior distribution, and their targets serve as a data-informed function-space prior. We derive an objective function from this bound and use it to train the critic ensemble. Each critic trains an individual soft actor network, implemented as a shared trunk and critic-specific heads. The agent performs deep exploration by acting epsilon-greedily on a randomly chosen actor head. Our proposed algorithm, named PAC-Bayesian Actor-Critic (PBAC), is the only algorithm to consistently discover delayed rewards on a diverse set of continuous control tasks with varying difficulty.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Continuous Actions

Delayed Rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

PAC-Bayesian

Actor-Critic Learning

Delayed Reward Handling

🔎 Similar Papers

No similar papers found.