Iterative Refinement of Flow Policies in Probability Space for Online Reinforcement Learning

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Behavior cloning and flow/diffusion policies suffer from distributional shift and exhibit instability during standard reinforcement learning-based online fine-tuning. To address this, we propose a stepwise flow policy framework that discretizes continuous flow matching into multi-step Wasserstein gradient updates grounded in the Jordan–Kinderlehrer–Otto (JKO) principle. We introduce a Wasserstein trust-region constraint to ensure optimization stability, and integrate entropy regularization, variational JKO optimization, and cascaded small-flow module training—enabling low-overhead, provably convergent distributed policy adaptation. Experiments across diverse robotic control tasks demonstrate substantial improvements in online adaptation performance, while achieving high training efficiency, minimal memory footprint, and strong convergence stability.

Technology Category

Application Category

📝 Abstract
While behavior cloning with flow/diffusion policies excels at learning complex skills from demonstrations, it remains vulnerable to distributional shift, and standard RL methods struggle to fine-tune these models due to their iterative inference process and the limitations of existing workarounds. In this work, we introduce the Stepwise Flow Policy (SWFP) framework, founded on the key insight that discretizing the flow matching inference process via a fixed-step Euler scheme inherently aligns it with the variational Jordan-Kinderlehrer-Otto (JKO) principle from optimal transport. SWFP decomposes the global flow into a sequence of small, incremental transformations between proximate distributions. Each step corresponds to a JKO update, regularizing policy changes to stay near the previous iterate and ensuring stable online adaptation with entropic regularization. This decomposition yields an efficient algorithm that fine-tunes pre-trained flows via a cascade of small flow blocks, offering significant advantages: simpler/faster training of sub-models, reduced computational/memory costs, and provable stability grounded in Wasserstein trust regions. Comprehensive experiments demonstrate SWFP's enhanced stability, efficiency, and superior adaptation performance across diverse robotic control benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Addresses distributional shift in flow policy fine-tuning
Overcomes limitations of standard RL for iterative inference models
Enables stable online adaptation of pre-trained flow policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stepwise Flow Policy discretizes flow matching inference
Decomposes global flow into incremental JKO transformations
Fine-tunes pre-trained flows via cascade of small blocks
🔎 Similar Papers
No similar papers found.
M
Mingyang Sun
Zhejiang University, Westlake University, Shanghai Innovation Institute
Pengxiang Ding
Pengxiang Ding
Zhejiang University
Human Motion PredictionLarge Language ModelEmbodied AI
W
Weinan Zhang
Zhejiang University, Shanghai Jiao Tong University
D
Donglin Wang
Westlake University