π€ AI Summary
Standard Softmax policy gradient exhibits exponentially slow transient behavior near suboptimal simplex vertices due to a self-trapping effect induced by actions with negative advantages. This work proposes Delightful Policy Gradient (DG), which eliminates this trapping mechanism by gating each policy gradient term with the product of advantage and action surprisal. Theoretical analysis reveals that all actions superior to the vertex are βalliesβ contributing non-negatively to learning; in the zero-temperature limit, DG avoids corner traps and establishes, for the first time, an escape bound dependent on the logarithm of the initial probability ratios. Empirically, DG achieves an asymptotic convergence rate of \(O(1/t)\) in both K-armed bandits and tabular MDPs, and demonstrates significantly faster recovery from poor initialization than standard policy gradient in MNIST contextual bandit experiments.
π Abstract
Softmax policy gradient converges at $O(1/t)$, but its transient behavior near sub-optimal corners of the simplex can be exponentially slow. The bottleneck is self-trapping: negative-advantage actions reinforce the corner policy and can initially push the optimal action backward. We study \emph{Delightful Policy Gradient} (DG), which gates each policy-gradient term by the product of advantage and action surprisal. For $K$-armed bandits, we prove that the zero-temperature limit of DG removes this corner-trapping mechanism on a quantitative sector near any sub-optimal corner, yielding a first-exit escape bound logarithmic in the initial probability ratio. At every fixed temperature, the same local mechanism persists because harmful actions are polynomially suppressed as they become rare. A key structural insight is that every action better than the corner action is an \emph{ally}: its contribution to escape is non-negative. Combining corner instability with a monotonic value improvement identity, we prove that DG converges globally to the optimal policy in both bandits and tabular MDPs at an asymptotic $O(1/t)$ rate. We also show, via an exact counterexample, that this tabular mechanism can fail under shared function approximation. In MNIST contextual bandits with a shared-parameter neural network, DG nevertheless recovers from bad initializations faster than standard policy gradient, suggesting that the counterexample marks a boundary of the theory rather than a practical prohibition.