$Ο€_ exttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

πŸ“… 2025-10-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Flow-based vision-language-action (VLA) models (e.g., Ο€β‚€, Ο€β‚€.β‚…) face challenges in large-scale reinforcement learning (RL), including intractable exact action log-likelihood computation and heavy reliance on annotated data. Method: We propose Flow-Noise and Flow-SDEβ€”two novel approaches that respectively enable differentiable, exact log-likelihood estimation during iterative denoising and construct an efficient stochastic dynamical framework via ODE-to-SDE conversion. Integrated with learnable noise networks and distributed RL algorithms, our methods support multi-task online training and large-scale parallel simulation. Contribution/Results: On LIBERO and ManiSkill benchmarks, Ο€β‚€ and Ο€β‚€.β‚… achieve task accuracies of 98.3% and 85.7%, significantly surpassing supervised fine-tuning baselines. Our work marks the first end-to-end, scalable, high-accuracy RL optimization grounded in flow-based generative modeling.

Technology Category

Application Category

πŸ“ Abstract
Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (e.g., $Ο€_0$, $Ο€_{0.5}$) remains challenging due to intractable action log-likelihoods from iterative denoising. We address this challenge with $Ο€_{ ext{RL}}$, an open-source framework for training flow-based VLAs in parallel simulation. $Ο€_{ ext{RL}}$ implements two RL algorithms: (1) {Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) {Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate $Ο€_{ ext{RL}}$ on LIBERO and ManiSkill benchmarks. On LIBERO, $Ο€_{ ext{RL}}$ boosts few-shot SFT models $Ο€_0$ and $Ο€_{0.5}$ from 57.6% to 97.6% and from 77.1% to 98.3%, respectively. In ManiSkill, we train $Ο€_{ ext{RL}}$ in 320 parallel environments, improving $Ο€_0$ from 41.6% to 85.7% and $Ο€_{0.5}$ from 40.0% to 84.8% across 4352 pick-and-place tasks, demonstrating scalable multitask RL under heterogeneous simulation. Overall, $Ο€_{ ext{RL}}$ achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs.
Problem

Research questions and friction points this paper is trying to address.

Addresses intractable action log-likelihoods in flow-based vision-language-action models
Enables scalable reinforcement learning for iterative denoising processes
Improves robot task performance through parallel simulation training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses online RL fine-tuning for vision-language-action models
Implements Flow-Noise with learnable network for exact likelihood
Employs Flow-SDE with ODE-to-SDE conversion for exploration
πŸ”Ž Similar Papers
No similar papers found.
K
Kang Chen
Peking University, Zhongguancun Academy
Z
Zhihao Liu
Institute of Automation, Chinese Academy of Sciences, Zhongguancun Academy
T
Tonghe Zhang
Carnegie Mellon University, Infinigence AI
Z
Zhen Guo
Institute of Automation, Chinese Academy of Sciences
S
Si Xu
Institute of Automation, Chinese Academy of Sciences
H
Hao Lin
Institute of Automation, Chinese Academy of Sciences
H
Hongzhi Zang
Tsinghua University
Q
Quanlu Zhang
Institute of Automation, Chinese Academy of Sciences
Zhaofei Yu
Zhaofei Yu
Peking University
Brain-inspired ComputingSpiking Neural NetworksComputational Neuroscience
Guoliang Fan
Guoliang Fan
Professor of Electrical Engineering at Oklahoma State University
image processingcomputer visionmachine learningmultimedia
Tiejun Huang
Tiejun Huang
Professor,School of Computer Science, Peking University
Visual Information Processing
Y
Yu Wang
Tsinghua University
C
Chao Yu
Tsinghua University, Zhongguancun Academy