HAEPO: History-Aggregated Exploratory Policy Optimization

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing strategy optimization methods (e.g., DPO, GRPO) suffer from limited exploration capability in long-horizon tasks. This paper proposes History-Aware Exploratory Policy Optimization (HAEPO), the first approach to explicitly incorporate full trajectory histories into policy updates. HAEPO compresses trajectory information via cumulative log-probabilities and employs Plackett–Luce softmax weighting, entropy regularization, and a soft KL constraint relative to a frozen reference policy—jointly enhancing long-term exploration while preserving training stability. Its core innovations are a normalized trajectory weighting scheme and a history-sensitive gradient correction mechanism. Experiments demonstrate that HAEPO achieves faster convergence, more thorough exploration, and superior reward alignment across diverse long-horizon tasks, consistently outperforming PPO, GRPO, and DPO.

Technology Category

Application Category

📝 Abstract
Exploration is essential in modern learning, from reinforcement learning environments with small neural policies to large language models (LLMs). Existing work, such as DPO, leverages full sequence log-likelihoods to capture an entire trajectory of the model's decisions, while methods like GRPO aggregate per-token ratios into a trajectory-level update. However, both often limit exploration on long-horizon tasks. We introduce History-Aggregated Exploratory Policy Optimization (HAEPO), a history-aware exploratory loss to combat these shortcomings. HAEPO compresses each trajectory into the sum of its logarithmic probabilities (a cumulative logarithmic likelihood), and applies a Plackett-Luce softmax across trajectories to obtain normalized weights proportional to their returns, thus encouraging broader exploration. We add entropy regularization to stabilize the aggressive updates to prevent premature collapse and a soft KL penalty relative to a frozen copy of the previous (reference) policy. Empirically, HAEPO converges fast, explores thoroughly, aligns closely with true rewards, and demonstrates robust learning behavior better or at par with PPO, GRPO, and DPO across diverse tasks. Thus, HAEPO provides a stable and interpretable framework by explicitly leveraging full-trajectory history while balancing exploration and stability.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited exploration in long-horizon reinforcement learning tasks
Improves policy optimization using history-aware exploratory loss
Balances exploration and stability with entropy and KL regularization
Innovation

Methods, ideas, or system contributions that make the work stand out.

History-aware exploratory loss for broader exploration
Plackett-Luce softmax across trajectories for normalized weights
Entropy regularization and soft KL penalty for stability
🔎 Similar Papers
No similar papers found.
G
Gaurish Trivedi
Birla Institute of Technology and Science, Pilani
A
Alakh Sharma
Birla Institute of Technology and Science, Pilani
K
Kartikey Singh Bhandari
Birla Institute of Technology and Science, Pilani
D
Dhruv Kumar
Birla Institute of Technology and Science, Pilani
P
Pratik Narang
Birla Institute of Technology and Science, Pilani
Jagat Sesh Challa
Jagat Sesh Challa
Assistant Professor, Department of Computer Science & Information Systems, BITS Pilani
Big Data AnalyticsComputer VisionFederated LearningMaterials InformaticsHCI