Target Policy Optimization

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the sensitivity of conventional policy gradient methods in reinforcement learning to hyperparameter choices, which often leads to overshooting or undershooting during updates due to their simultaneous determination of action sequence quality and parameter update direction. The paper introduces Target Policy Optimization (TPO), which decouples policy updates into two distinct steps: first constructing an exponentially weighted target distribution $q_i \propto p_i^{\text{old}} \exp(u_i)$ based on returns, and then fitting the policy by minimizing its cross-entropy with this target distribution. This approach substantially reduces reliance on learning rates and optimizer selection. Empirical results demonstrate consistent superiority over baseline methods—including PG, PPO, GRPO, and DG—across diverse settings such as tabular bandits, Transformer-based sequential tasks, and RLVR experiments with billion-parameter language models, particularly in sparse-reward scenarios.

Technology Category

Application Category

📝 Abstract

In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce \emph{Target Policy Optimization} (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution $q_i \propto p_i^{\,\mathrm{old}} \exp(u_i)$ and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is $p^θ- q$, which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at https://github.com/JeanKaddour/tpo.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Policy Optimization

Sparse Reward

Target Distribution

Policy Gradient

Innovation

Methods, ideas, or system contributions that make the work stand out.

Target Policy Optimization

reinforcement learning

sparse reward