Complexity-Driven Policy Optimization

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Entropy maximization in policy gradient methods often promotes exploration but risks collapsing policies into uniform random distributions, sacrificing structural coherence and sample efficiency. Method: We propose a complexity-driven policy optimization framework that replaces conventional entropy regularization with a novel complexity reward—defined as the product of Shannon entropy and disequilibrium—to dynamically balance randomness and structure. Integrated into the PPO algorithm, our approach penalizes both excessive disorder and rigid determinism, thereby encouraging agents to autonomously discover adaptive, nontrivial behavioral patterns. Contribution/Results: Empirical evaluation across diverse exploration-intensive benchmarks demonstrates that the complexity coefficient exhibits superior robustness compared to the entropy coefficient, yielding significantly improved training stability and higher asymptotic performance. The method effectively mitigates policy degeneration while preserving exploratory capacity, advancing the design of structured, efficient reinforcement learning policies.

Technology Category

Application Category

📝 Abstract

Policy gradient methods often balance exploitation and exploration via entropy maximization. However, maximizing entropy pushes the policy towards a uniform random distribution, which represents an unstructured and sometimes inefficient exploration strategy. In this work, we propose replacing the entropy bonus with a more robust complexity bonus. In particular, we adopt a measure of complexity, defined as the product of Shannon entropy and disequilibrium, where the latter quantifies the distance from the uniform distribution. This regularizer encourages policies that balance stochasticity (high entropy) with structure (high disequilibrium), guiding agents toward regimes where useful, non-trivial behaviors can emerge. Such behaviors arise because the regularizer suppresses both extremes, e.g., maximal disorder and complete order, creating pressure for agents to discover structured yet adaptable strategies. Starting from Proximal Policy Optimization (PPO), we introduce Complexity-Driven Policy Optimization (CDPO), a new learning algorithm that replaces entropy with complexity. We show empirically across a range of discrete action space tasks that CDPO is more robust to the choice of the complexity coefficient than PPO is with the entropy coefficient, especially in environments requiring greater exploration.

Problem

Research questions and friction points this paper is trying to address.

Replacing entropy maximization with complexity for better exploration

Balancing stochasticity and structure in policy gradient methods

Developing CDPO algorithm to improve robustness in exploration tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces entropy bonus with complexity bonus

Combines Shannon entropy and disequilibrium measures

Introduces Complexity-Driven Policy Optimization (CDPO)

🔎 Similar Papers

No similar papers found.