Proximal Policy Optimization in Path Space: A Schrödinger Bridge Perspective

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work extends Proximal Policy Optimization (PPO) beyond conventional action-space probability ratios to accommodate trajectory-level generative policies, such as those based on diffusion or flow models. By operating in path space and leveraging the Generalized Schrödinger Bridge (GSB) framework, the authors propose two novel objective functions: GSB-PPO-Clip and GSB-PPO-Penalty. Theoretical analysis reveals a principled connection between generative policies and path-space regularization. Empirical results demonstrate that GSB-PPO-Penalty significantly outperforms its clipping-based counterpart in both training stability and final performance, thereby validating the efficacy of path-space proximal regularization. This approach establishes a unified training paradigm for generative policy optimization, bridging reinforcement learning with advanced trajectory-generation methodologies.

Technology Category

Application Category

📝 Abstract

On-policy reinforcement learning with generative policies is promising but remains underexplored. A central challenge is that proximal policy optimization (PPO) is traditionally formulated in terms of action-space probability ratios, whereas diffusion- and flow-based policies are more naturally represented as trajectory-level generative processes. In this work, we propose GSB-PPO, a path-space formulation of generative PPO inspired by the Generalized Schrödinger Bridge (GSB). Our framework lifts PPO-style proximal updates from terminal actions to full generation trajectories, yielding a unified view of on-policy optimization for generative policies. Within this framework, we develop two concrete objectives: a clipping-based objective, GSB-PPO-Clip, and a penalty-based objective, GSB-PPO-Penalty. Experimental results show that while both objectives are compatible with on-policy training, the penalty formulation consistently delivers better stability and performance than the clipping counterpart. Overall, our results highlight path-space proximal regularization as an effective principle for training generative policies with PPO.

Problem

Research questions and friction points this paper is trying to address.

Proximal Policy Optimization

Generative Policies

Path Space

Schrödinger Bridge

On-policy Reinforcement Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proximal Policy Optimization

Generative Policies

Schrödinger Bridge