Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Diffusion- and flow-matching-based policies suffer from slow inference and unstable training in offline reinforcement learning (RL). Method: We propose the Single-Step Completion Policy (SSCP), which enhances the flow-matching objective to directly predict action completion vectors, enabling one-shot, high-fidelity action generation. SSCP integrates flow-matching modeling, an enhanced sampling objective, an offline Actor-Critic framework, and target-conditioned policy design. Contributions/Results: (1) First introduction of a single-step completion mechanism, unifying the multimodal expressiveness of generative models with the efficiency of one-step policies; (2) First extension of flow matching to target-conditioned RL, enabling flat policies to implicitly exploit subgoal structure. Experiments show SSCP significantly outperforms diffusion-based baselines on standard offline RL and behavior cloning benchmarks, achieves several-fold inference speedup, and supports offline-to-online transfer and online fine-tuning—demonstrating superior generalization and adaptability.

Technology Category

Application Category

📝 Abstract

Generative models such as diffusion and flow-matching offer expressive policies for offline reinforcement learning (RL) by capturing rich, multimodal action distributions, but their iterative sampling introduces high inference costs and training instability due to gradient propagation across sampling steps. We propose the extit{Single-Step Completion Policy} (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation. In an off-policy actor-critic framework, SSCP combines the expressiveness of generative models with the training and inference efficiency of unimodal policies, without requiring long backpropagation chains. Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability over diffusion-based baselines. We further extend SSCP to goal-conditioned RL, enabling flat policies to exploit subgoal structures without explicit hierarchical inference. SSCP achieves strong results across standard offline RL and behavior cloning benchmarks, positioning it as a versatile, expressive, and efficient framework for deep RL and sequential decision-making.

Problem

Research questions and friction points this paper is trying to address.

Reducing high inference costs in generative policy learning

Improving training stability without long backpropagation chains

Enabling efficient one-shot action generation in RL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Augmented flow-matching for direct completion vectors

One-shot action generation without iterative sampling

Combines expressiveness with training efficiency

🔎 Similar Papers

Recurrent Natural Policy Gradient for POMDPs