🤖 AI Summary
This work addresses the instability in offline reinforcement learning arising from the trade-off between reward maximization and behavioral constraints, which is often exacerbated by latent-policy approaches relying on indirect distillation through a latent-space critic, leading to information loss and convergence difficulties. To overcome this, we propose MeanFlow, a differentiable one-step policy that eliminates the surrogate latent critic and instead directly backpropagates gradients from the Q-function in the original action space into the latent space, enabling end-to-end, high-fidelity policy optimization. Furthermore, the proposed policy serves as a generative prior for behavioral constraints, effectively decoupling policy improvement from constraint enforcement. Evaluated on OGBench and real-world robotic tasks, our method achieves state-of-the-art performance—significantly outperforming behavioral cloning and existing latent-guided baselines—without requiring complex hyperparameter tuning.
📝 Abstract
Offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration. Yet, offline RL's performance often hinges on a brittle trade-off between (1) return maximization, which can push policies outside the dataset support, and (2) behavioral constraints, which typically require sensitive hyperparameter tuning. Latent steering offers a structural way to stay within the dataset support during RL, but existing offline adaptations commonly approximate action values using latent-space critics learned via indirect distillation, which can lose information and hinder convergence. We propose Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor. By eliminating proxy latent critics, LPS allows an original-action-space critic to guide end-to-end latent-space optimization, while the one-step MeanFlow policy serves as a behavior-constrained generative prior. This decoupling yields a robust method that works out-of-the-box with minimal tuning. Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.