CAE: Repurposing the Critic as an Explorer in Deep Reinforcement Learning

📅 2025-03-23

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

In deep reinforcement learning, exploration suffers from both a lack of theoretical guarantees and suboptimal empirical performance. To address this, we propose a parameter-free exploration mechanism that repurposes the standard critic network—eliminating the need for auxiliary parameters—by modeling state-action value uncertainty as a linear multi-armed bandit and introducing an adaptive scaling strategy alongside a lightweight auxiliary module, CAE+, which adds only ~10 lines of code and <1% extra parameters. Theoretically, our method achieves a sublinear regret bound in continuous state spaces. Empirically, it significantly outperforms state-of-the-art methods on MuJoCo and MiniHack benchmarks. Our approach thus unifies strong theoretical grounding with remarkable engineering simplicity, enabling dual functionality of the critic for both policy evaluation and uncertainty-aware exploration.

Technology Category

Application Category

📝 Abstract

Exploration remains a critical challenge in reinforcement learning, as many existing methods either lack theoretical guarantees or fall short of practical effectiveness. In this paper, we introduce CAE, a lightweight algorithm that repurposes the value networks in standard deep RL algorithms to drive exploration without introducing additional parameters. CAE utilizes any linear multi-armed bandit technique and incorporates an appropriate scaling strategy, enabling efficient exploration with provable sub-linear regret bounds and practical stability. Notably, it is simple to implement, requiring only around 10 lines of code. In complex tasks where learning an effective value network proves challenging, we propose CAE+, an extension of CAE that incorporates an auxiliary network. This extension increases the parameter count by less than 1% while maintaining implementation simplicity, adding only about 10 additional lines of code. Experiments on MuJoCo and MiniHack show that both CAE and CAE+ outperform state-of-the-art baselines, bridging the gap between theoretical rigor and practical efficiency.

Problem

Research questions and friction points this paper is trying to address.

Addressing exploration challenges in reinforcement learning

Repurposing value networks for efficient exploration

Achieving theoretical and practical exploration efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Repurposes value networks for exploration

Uses linear bandit with scaling strategy

Extends with auxiliary network (CAE+)

🔎 Similar Papers

No similar papers found.