Proximal Policy Optimization in Path Space: A Schrödinger Bridge Perspective

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work extends Proximal Policy Optimization (PPO) beyond conventional action-space probability ratios to accommodate trajectory-level generative policies, such as those based on diffusion or flow models. By operating in path space and leveraging the Generalized Schrödinger Bridge (GSB) framework, the authors propose two novel objective functions: GSB-PPO-Clip and GSB-PPO-Penalty. Theoretical analysis reveals a principled connection between generative policies and path-space regularization. Empirical results demonstrate that GSB-PPO-Penalty significantly outperforms its clipping-based counterpart in both training stability and final performance, thereby validating the efficacy of path-space proximal regularization. This approach establishes a unified training paradigm for generative policy optimization, bridging reinforcement learning with advanced trajectory-generation methodologies.

Technology Category

Application Category

📝 Abstract
On-policy reinforcement learning with generative policies is promising but remains underexplored. A central challenge is that proximal policy optimization (PPO) is traditionally formulated in terms of action-space probability ratios, whereas diffusion- and flow-based policies are more naturally represented as trajectory-level generative processes. In this work, we propose GSB-PPO, a path-space formulation of generative PPO inspired by the Generalized Schrödinger Bridge (GSB). Our framework lifts PPO-style proximal updates from terminal actions to full generation trajectories, yielding a unified view of on-policy optimization for generative policies. Within this framework, we develop two concrete objectives: a clipping-based objective, GSB-PPO-Clip, and a penalty-based objective, GSB-PPO-Penalty. Experimental results show that while both objectives are compatible with on-policy training, the penalty formulation consistently delivers better stability and performance than the clipping counterpart. Overall, our results highlight path-space proximal regularization as an effective principle for training generative policies with PPO.
Problem

Research questions and friction points this paper is trying to address.

Proximal Policy Optimization
Generative Policies
Path Space
Schrödinger Bridge
On-policy Reinforcement Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proximal Policy Optimization
Generative Policies
Schrödinger Bridge
Path Space
On-policy Reinforcement Learning
🔎 Similar Papers
No similar papers found.