Trust Regions Sell, But Who's Buying? Overlap Geometry as an Alternative Trust Region for Policy Optimization

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the instability in policy optimization arising from rare but extreme likelihood ratio deviations that traditional KL divergence–based trust-region methods fail to effectively constrain. To mitigate this issue, the authors introduce the Bhattacharyya coefficient into trust-region policy optimization for the first time, enabling tighter control over the tail behavior of likelihood ratios by explicitly constraining the overlap between old and new policy distributions. Building on this insight, they propose a square-root ratio update mechanism and incorporate a penalty term derived from the Hellinger distance, yielding two novel algorithms: BTRPO, which applies a quadratic penalty, and BPPO, which clips the square-root likelihood ratio. Under identical training budgets, both algorithms demonstrate superior robustness and overall performance compared to baselines, as validated by RLiable evaluation protocols.

Technology Category

Application Category

📝 Abstract

Standard trust-region methods constrain policy updates via Kullback-Leibler (KL) divergence. However, KL controls only an average divergence and does not directly prevent rare, large likelihood-ratio excursions that destabilize training--precisely the failure mode that motivates heuristics such as PPO's clipping. We propose overlap geometry as an alternative trust region, constraining distributional overlap via the Bhattacharyya coefficient (closely related to the Hellinger/Renyi-1/2 geometry). This objective penalizes separation in the ratio tails, yielding tighter control over likelihood-ratio excursions without relying on total variation bounds that can be loose in tail regimes. We derive Bhattacharyya-TRPO (BTRPO) and Bhattacharyya-PPO (BPPO), enforcing overlap constraints via square-root ratio updates: BPPO clips the square-root ratio q = sqrt(r), and BTRPO applies a quadratic Hellinger/Bhattacharyya penalty. Empirically, overlap-based updates improve robustness and aggregate performance as measured by RLiable under matched training budgets, suggesting overlap constraints as a practical, principled alternative to KL for stable policy optimization.

Problem

Research questions and friction points this paper is trying to address.

trust region

policy optimization

KL divergence

likelihood ratio

distributional overlap

Innovation

Methods, ideas, or system contributions that make the work stand out.

overlap geometry

Bhattacharyya coefficient

trust region