Score-Based One-step MeanFlow Policy Optimization

📅 2026-05-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

204K/year
🤖 AI Summary
This work addresses the high computational cost and reliance on target-distribution samples that plague existing multi-step denoising strategies based on MeanFlow in online reinforcement learning. To overcome these limitations, we propose the Single-step MeanFlow policy Optimization Method (SOM), which, for the first time, constructs a target velocity field in a fully online setting without requiring any target samples. SOM estimates the score function via the Q-function and leverages a probability flow ordinary differential equation (ODE) to generate actions, enabling policy sampling with only a single neural network forward pass. By reducing multi-step generation to a single step, the method achieves state-of-the-art performance on motion control tasks while substantially decreasing both training and inference time.
📝 Abstract
Diffusion and flow matching have emerged as expressive policy classes in reinforcement learning, but their reliance on multi-step denoising imposes substantial computational overhead at inference time, which is particularly problematic in online RL. MeanFlow offers a promising alternative by learning an average velocity field that maps noise to data in a single network evaluation. However, MeanFlow typically requires samples from the target distribution to construct its target velocity field, which are unavailable in online RL. We propose Score-Based One-step MeanFlow Policy Optimization (SOM), an actor-critic algorithm that resolves this by constructing the target velocity field directly from the Q-function via score estimation and a probability flow ODE, thereby concentrating probability mass on high-value modes. In the fully online RL setting, SOM achieves state-of-the-art performance on locomotion tasks with a single generation step, while substantially reducing both training and inference time compared to prior diffusion- and flow-matching-based policies.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
online RL
flow matching
diffusion policy
MeanFlow
Innovation

Methods, ideas, or system contributions that make the work stand out.

Score-based
One-step generation
MeanFlow
Online reinforcement learning
Probability flow ODE