🤖 AI Summary
This work addresses the instability in policy optimization arising from rare but extreme likelihood ratio deviations that traditional KL divergence–based trust-region methods fail to effectively constrain. To mitigate this issue, the authors introduce the Bhattacharyya coefficient into trust-region policy optimization for the first time, enabling tighter control over the tail behavior of likelihood ratios by explicitly constraining the overlap between old and new policy distributions. Building on this insight, they propose a square-root ratio update mechanism and incorporate a penalty term derived from the Hellinger distance, yielding two novel algorithms: BTRPO, which applies a quadratic penalty, and BPPO, which clips the square-root likelihood ratio. Under identical training budgets, both algorithms demonstrate superior robustness and overall performance compared to baselines, as validated by RLiable evaluation protocols.
📝 Abstract
Standard trust-region methods constrain policy updates via Kullback-Leibler (KL) divergence. However, KL controls only an average divergence and does not directly prevent rare, large likelihood-ratio excursions that destabilize training--precisely the failure mode that motivates heuristics such as PPO's clipping. We propose overlap geometry as an alternative trust region, constraining distributional overlap via the Bhattacharyya coefficient (closely related to the Hellinger/Renyi-1/2 geometry). This objective penalizes separation in the ratio tails, yielding tighter control over likelihood-ratio excursions without relying on total variation bounds that can be loose in tail regimes. We derive Bhattacharyya-TRPO (BTRPO) and Bhattacharyya-PPO (BPPO), enforcing overlap constraints via square-root ratio updates: BPPO clips the square-root ratio q = sqrt(r), and BTRPO applies a quadratic Hellinger/Bhattacharyya penalty. Empirically, overlap-based updates improve robustness and aggregate performance as measured by RLiable under matched training budgets, suggesting overlap constraints as a practical, principled alternative to KL for stable policy optimization.