EXPO: Stable Reinforcement Learning with Expressive Policies

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This paper addresses the challenge of unstable gradient updates during online reinforcement learning with offline data support, particularly when employing highly expressive policies such as diffusion models. To tackle this, we propose a two-stage collaborative optimization framework. First, a base policy is pre-trained via imitation learning on offline data. Second, a lightweight, differentiable Gaussian editing policy is introduced to directly optimize actions in the action space by maximizing the Q-value, using TD error for backpropagation—thereby avoiding unstable gradient updates to the value function. Our core innovation lies in decoupling representational capacity from training stability: the base policy ensures expressive power, while the editing policy enables efficient and stable value maximization. Experiments across diverse offline fine-tuning and online training settings demonstrate that our method improves sample efficiency by 2–3× on average over state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract

We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable value maximization. Unlike simpler Gaussian policies commonly used in online RL, expressive policies like diffusion and flow-matching policies are parameterized by a long denoising chain, which hinders stable gradient propagation from actions to policy parameters when optimizing against some value function. Our key insight is that we can address stable value maximization by avoiding direct optimization over value with the expressive policy and instead construct an on-the-fly RL policy to maximize Q-value. We propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that utilizes an on-the-fly policy to maximize value with two parameterized policies -- a larger expressive base policy trained with a stable imitation learning objective and a light-weight Gaussian edit policy that edits the actions sampled from the base policy toward a higher value distribution. The on-the-fly policy optimizes the actions from the base policy with the learned edit policy and chooses the value maximizing action from the base and edited actions for both sampling and temporal-difference (TD) backup. Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods both in the setting of fine-tuning a pretrained policy given offline data and in leveraging offline data to train online.

Problem

Research questions and friction points this paper is trying to address.

Stable training of expressive policies in online RL

Challenges in gradient propagation for diffusion policies

Improving sample efficiency in offline-to-online RL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses expressive base policy with imitation learning

Employs lightweight Gaussian edit policy

Combines base and edited actions for optimization

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL