Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

In offline reinforcement learning, conventional behavior regularization methods struggle to distinguish high-value from low-value actions in datasets, hindering policy optimization under suboptimal data. To address this, we propose the Guided Flow Policy (GFP), which establishes a bidirectional guidance mechanism: (i) a one-step executor distilled to imitate high-value actions, and (ii) a multi-step flow-matching policy that enforces distributional alignment with high-quality trajectory segments. Technically, GFP integrates flow matching, weighted behavior cloning, policy distillation, and critic-guided action selection—enabling value-aware action filtering and distributional consistency constraints. Evaluated across 144 state- and pixel-based tasks from OGBench, Minari, and D4RL, GFP consistently outperforms prior methods, especially on suboptimal datasets and complex tasks, demonstrating substantial improvements in both sample efficiency and asymptotic performance.

Technology Category

Application Category

📝 Abstract

Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks. Webpage: https://simple-robotics.github.io/publications/guided-flow-policy/

Problem

Research questions and friction points this paper is trying to address.

Distinguishes high-value from low-value actions in offline RL

Uses mutual guidance between flow policy and actor

Improves performance on suboptimal and challenging datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-step flow-matching policy coupled with distilled actor

Weighted behavior cloning focuses on high-value actions

Mutual guidance aligns actor with best dataset transitions

🔎 Similar Papers

Offline Hierarchical Reinforcement Learning via Inverse Optimization