Latent Policy Steering through One-Step Flow Policies

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the instability in offline reinforcement learning arising from the trade-off between reward maximization and behavioral constraints, which is often exacerbated by latent-policy approaches relying on indirect distillation through a latent-space critic, leading to information loss and convergence difficulties. To overcome this, we propose MeanFlow, a differentiable one-step policy that eliminates the surrogate latent critic and instead directly backpropagates gradients from the Q-function in the original action space into the latent space, enabling end-to-end, high-fidelity policy optimization. Furthermore, the proposed policy serves as a generative prior for behavioral constraints, effectively decoupling policy improvement from constraint enforcement. Evaluated on OGBench and real-world robotic tasks, our method achieves state-of-the-art performance—significantly outperforming behavioral cloning and existing latent-guided baselines—without requiring complex hyperparameter tuning.

Technology Category

Application Category

📝 Abstract

Offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration. Yet, offline RL's performance often hinges on a brittle trade-off between (1) return maximization, which can push policies outside the dataset support, and (2) behavioral constraints, which typically require sensitive hyperparameter tuning. Latent steering offers a structural way to stay within the dataset support during RL, but existing offline adaptations commonly approximate action values using latent-space critics learned via indirect distillation, which can lose information and hinder convergence. We propose Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor. By eliminating proxy latent critics, LPS allows an original-action-space critic to guide end-to-end latent-space optimization, while the one-step MeanFlow policy serves as a behavior-constrained generative prior. This decoupling yields a robust method that works out-of-the-box with minimal tuning. Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.

Problem

Research questions and friction points this paper is trying to address.

offline reinforcement learning

latent policy steering

behavioral constraints

dataset support

action-value approximation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Policy Steering

Offline Reinforcement Learning

One-Step Flow Policy