Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In offline reinforcement learning, conventional behavior regularization methods struggle to distinguish high-value from low-value actions in datasets, hindering policy optimization under suboptimal data. To address this, we propose the Guided Flow Policy (GFP), which establishes a bidirectional guidance mechanism: (i) a one-step executor distilled to imitate high-value actions, and (ii) a multi-step flow-matching policy that enforces distributional alignment with high-quality trajectory segments. Technically, GFP integrates flow matching, weighted behavior cloning, policy distillation, and critic-guided action selection—enabling value-aware action filtering and distributional consistency constraints. Evaluated across 144 state- and pixel-based tasks from OGBench, Minari, and D4RL, GFP consistently outperforms prior methods, especially on suboptimal datasets and complex tasks, demonstrating substantial improvements in both sample efficiency and asymptotic performance.

Technology Category

Application Category

📝 Abstract
Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks. Webpage: https://simple-robotics.github.io/publications/guided-flow-policy/
Problem

Research questions and friction points this paper is trying to address.

Distinguishes high-value from low-value actions in offline RL
Uses mutual guidance between flow policy and actor
Improves performance on suboptimal and challenging datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-step flow-matching policy coupled with distilled actor
Weighted behavior cloning focuses on high-value actions
Mutual guidance aligns actor with best dataset transitions
🔎 Similar Papers
No similar papers found.
F
Franki Nguimatsia Tiofack
Inria and Departement d’Informatique de l’École normale supérieure, PSL Research University, France
T
Théotime Le Hellard
Inria and Departement d’Informatique de l’École normale supérieure, PSL Research University, France
F
Fabian Schramm
Inria and Departement d’Informatique de l’École normale supérieure, PSL Research University, France
Nicolas Perrin-Gilbert
Nicolas Perrin-Gilbert
ISIR - CNRS UMR7222
RoboticsBipedal locomotionReinforcement learningMachine learningMotion planning
Justin Carpentier
Justin Carpentier
Research Scientist, Inria - École Normale Supérieure, Paris
Optimal ControlSimulationNumerical OptimizationRoboticsReinforcement Learning