Directional Ensemble Aggregation for Actor-Critics

📅 2025-07-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address training instability in off-policy reinforcement learning for continuous control caused by Q-value overestimation, this paper proposes an adaptive ensemble aggregation method. Unlike coarse-grained static strategies (e.g., minimum Q-value aggregation), our approach introduces two learnable directional parameters—one modulating critic conservatism and the other guiding actor exploration—alongside a data-driven, Bellman-error-direction-based weighting mechanism. Integrated within the Actor-Critic framework, the method enables end-to-end differentiable, uncertainty-aware dynamic ensemble aggregation. Crucially, the aggregation behavior adapts automatically across training stages, balancing stability and exploration efficiency. Empirical evaluation on standard continuous-control benchmarks (e.g., MuJoCo) and low-sample-efficiency settings demonstrates substantial improvements over baseline methods—including min-Q aggregation—validating the proposed method’s effectiveness, robustness, and generalization capability.

Technology Category

Application Category

📝 Abstract
Off-policy reinforcement learning in continuous control tasks depends critically on accurate $Q$-value estimates. Conservative aggregation over ensembles, such as taking the minimum, is commonly used to mitigate overestimation bias. However, these static rules are coarse, discard valuable information from the ensemble, and cannot adapt to task-specific needs or different learning regimes. We propose Directional Ensemble Aggregation (DEA), an aggregation method that adaptively combines $Q$-value estimates in actor-critic frameworks. DEA introduces two fully learnable directional parameters: one that modulates critic-side conservatism and another that guides actor-side policy exploration. Both parameters are learned using ensemble disagreement-weighted Bellman errors, which weight each sample solely by the direction of its Bellman error. This directional learning mechanism allows DEA to adjust conservatism and exploration in a data-driven way, adapting aggregation to both uncertainty levels and the phase of training. We evaluate DEA across continuous control benchmarks and learning regimes - from interactive to sample-efficient - and demonstrate its effectiveness over static ensemble strategies.
Problem

Research questions and friction points this paper is trying to address.

Mitigates Q-value overestimation bias in off-policy RL
Adapts ensemble aggregation to task-specific learning needs
Balances critic conservatism and actor exploration dynamically
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Q-value aggregation with learnable parameters
Directional learning using ensemble disagreement-weighted errors
Data-driven adjustment of conservatism and exploration
🔎 Similar Papers
No similar papers found.
N
Nicklas Werge
Department of Mathematics and Computer Science, University of Southern Denmark
Yi-Shan Wu
Yi-Shan Wu
South Denmark University
Machine Learning
B
Bahareh Tasdighi
Department of Mathematics and Computer Science, University of Southern Denmark
Melih Kandemir
Melih Kandemir
Associate Professor of Machine Learning at the University of Southern Denmark
Bayesian InferenceNeural Stochastic ProcessesDynamics ModelingReinforcement Learning