🤖 AI Summary
This work addresses the challenge of deploying flow-based vision-language-action (VLA) models in online reinforcement learning, where intractable likelihoods during multi-step sampling hinder effective training. The authors propose π-StepNFT, a novel framework that enables the first likelihood-free, value-network-free online training of flow-based VLA policies. By performing policy updates via a single forward pass per step and incorporating a progressive negative perceptual fine-tuning mechanism, the method achieves fine-grained policy alignment across broad action spaces. Evaluated on the LIBERO benchmark, π-StepNFT demonstrates strong few-shot robustness and significantly outperforms value-based baselines in out-of-distribution scenarios from ManiSkill, effectively mitigating multimodal feature overfitting.
📝 Abstract
Flow-based vision-language-action (VLA) models excel in embodied control but suffer from intractable likelihoods during multi-step sampling, hindering online reinforcement learning. We propose \textbf{\textit{$\boldsymbolπ$-StepNFT}} (Step-wise Negative-aware Fine-Tuning), a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks. We identify that wider exploration spaces necessitate finer-grained, step-wise guidance for alignment. Empirically, $π$-StepNFT unlocks latent potential on LIBERO with competitive few-shot robustness. Moreover, it achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features. This property offers a scalable solution promising for complex real-world applications.