PrefPoE: Advantage-Guided Preference Fusion for Learning Where to Explore

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In reinforcement learning, conventional entropy-maximization exploration often induces high policy-update variance and low sample efficiency, hindering the balance between exploration and exploitation. To address this, we propose PrefPoE—a novel framework that introduces the Product-of-Experts (PoE) model to single-task RL for the first time. PrefPoE establishes an advantage-guided preference ensembling mechanism: it adaptively fuses a preference network with the primary policy via a soft trust region, enabling stable exploration while mitigating entropy collapse. Built upon policy gradients, the method jointly optimizes advantage estimation, preference network training, and dynamic PoE fusion, thereby enhancing update stability and sample efficiency. Empirical evaluation on standard continuous-control benchmarks demonstrates substantial improvements over baselines: +321% on HalfCheetah-v4, +69% on Ant-v4, and +276% on LunarLander-v2—validating both effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract
Exploration in reinforcement learning remains a critical challenge, as naive entropy maximization often results in high variance and inefficient policy updates. We introduce extbf{PrefPoE}, a novel extit{Preference-Product-of-Experts} framework that performs intelligent, advantage-guided exploration via the first principled application of product-of-experts (PoE) fusion for single-task exploration-exploitation balancing. By training a preference network to concentrate probability mass on high-advantage actions and fusing it with the main policy through PoE, PrefPoE creates a extbf{soft trust region} that stabilizes policy updates while maintaining targeted exploration. Across diverse control tasks spanning both continuous and discrete action spaces, PrefPoE demonstrates consistent improvements: +321% on HalfCheetah-v4 (1276~$ ightarrow$~5375), +69% on Ant-v4, +276% on LunarLander-v2, with consistently enhanced training stability and sample efficiency. Unlike standard PPO, which suffers from entropy collapse, PrefPoE sustains adaptive exploration through its unique dynamics, thereby preventing premature convergence and enabling superior performance. Our results establish that learning extit{where to explore} through advantage-guided preferences is as crucial as learning how to act, offering a general framework for enhancing policy gradient methods across the full spectrum of reinforcement learning domains. Code and pretrained models are available in supplementary materials.
Problem

Research questions and friction points this paper is trying to address.

Improves exploration efficiency in reinforcement learning
Stabilizes policy updates while maintaining targeted exploration
Prevents premature convergence through advantage-guided preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Advantage-guided preference fusion for exploration
Product-of-experts framework stabilizes policy updates
Soft trust region maintains targeted exploration efficiency
🔎 Similar Papers
No similar papers found.