$π_ exttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Flow-based vision-language-action (VLA) models (e.g., π₀, π₀.₅) face challenges in large-scale reinforcement learning (RL), including intractable exact action log-likelihood computation and heavy reliance on annotated data. Method: We propose Flow-Noise and Flow-SDE—two novel approaches that respectively enable differentiable, exact log-likelihood estimation during iterative denoising and construct an efficient stochastic dynamical framework via ODE-to-SDE conversion. Integrated with learnable noise networks and distributed RL algorithms, our methods support multi-task online training and large-scale parallel simulation. Contribution/Results: On LIBERO and ManiSkill benchmarks, π₀ and π₀.₅ achieve task accuracies of 98.3% and 85.7%, significantly surpassing supervised fine-tuning baselines. Our work marks the first end-to-end, scalable, high-accuracy RL optimization grounded in flow-based generative modeling.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (e.g., $π_0$, $π_{0.5}$) remains challenging due to intractable action log-likelihoods from iterative denoising. We address this challenge with $π_{ ext{RL}}$, an open-source framework for training flow-based VLAs in parallel simulation. $π_{ ext{RL}}$ implements two RL algorithms: (1) {Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) {Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate $π_{ ext{RL}}$ on LIBERO and ManiSkill benchmarks. On LIBERO, $π_{ ext{RL}}$ boosts few-shot SFT models $π_0$ and $π_{0.5}$ from 57.6% to 97.6% and from 77.1% to 98.3%, respectively. In ManiSkill, we train $π_{ ext{RL}}$ in 320 parallel environments, improving $π_0$ from 41.6% to 85.7% and $π_{0.5}$ from 40.0% to 84.8% across 4352 pick-and-place tasks, demonstrating scalable multitask RL under heterogeneous simulation. Overall, $π_{ ext{RL}}$ achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs.

Problem

Research questions and friction points this paper is trying to address.

Addresses intractable action log-likelihoods in flow-based vision-language-action models

Enables scalable reinforcement learning for iterative denoising processes

Improves robot task performance through parallel simulation training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses online RL fine-tuning for vision-language-action models

Implements Flow-Noise with learnable network for exact likelihood

Employs Flow-SDE with ODE-to-SDE conversion for exploration

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling