GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In reinforcement learning, Gaussian policies are differentiable and easy to optimize but lack expressive power, whereas generative policies—such as diffusion or flow-matching models—enable multimodal action modeling yet suffer from intractable likelihoods and noisy gradient propagation, leading to instability in online training. To address this, we propose an “optimization-generation decoupling” framework: a differentiable implicit policy enables efficient policy optimization, while a conditional diffusion decoder enriches action generation. We introduce a novel two-timescale update mechanism that ensures training stability without requiring explicit action likelihood computation. Our approach is algorithm-agnostic and integrates seamlessly with various RL algorithms. Empirically, it substantially outperforms both Gaussian policies and state-of-the-art generative baselines on continuous control benchmarks—e.g., achieving a normalized return of over 870 on HopperStand, more than triple the best prior baseline. This work is the first to simultaneously achieve high expressivity and training robustness in online RL.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize are often too simple to represent the multimodal action distributions needed for complex control. Gaussian policies provide tractable likelihoods and smooth gradients, but their unimodal form limits expressiveness. Conversely, generative policies based on diffusion or flow matching can model rich multimodal behaviors; however, in online RL, they are frequently unstable due to intractable likelihoods and noisy gradients propagating through deep sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this insight, we introduce GoRL (Generative Online Reinforcement Learning), a framework that optimizes a tractable latent policy while utilizing a conditional generative decoder to synthesize actions. A two-timescale update schedule enables the latent policy to learn stably while the decoder steadily increases expressiveness, without requiring tractable action likelihoods. Across a range of continuous-control tasks, GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines. Notably, on the HopperStand task, it reaches a normalized return above 870, more than 3 times that of the strongest baseline. These results demonstrate that separating optimization from generation provides a practical path to policies that are both stable and highly expressive.
Problem

Research questions and friction points this paper is trying to address.

Addresses instability of generative policies in online RL
Decouples optimization from generation for stable learning
Enhances expressiveness while maintaining tractable latent policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples optimization from generation for stability
Uses tractable latent policy with generative decoder
Two-timescale updates enhance expressiveness without likelihoods
🔎 Similar Papers
No similar papers found.
Chubin Zhang
Chubin Zhang
Tsinghua University
Embodied AI3D Vision
Z
Zhenglin Wan
Nanyang Technological University, Singapore
F
Feng Chen
Nanyang Technological University, Singapore
Xingrui Yu
Xingrui Yu
Scientist, CFAR, A*STAR
Machine LearningRobust Imitation LearningTrustworthy AI
I
Ivor Tsang
Centre for Frontier AI Research, A*STAR, Singapore
B
Bo An
Nanyang Technological University, Singapore