🤖 AI Summary
Existing flow-matching-based vision-language-action (VLA) models suffer from insufficient action accuracy in complex manipulation tasks, primarily because imitation-learning-only post-training fails to model data quality distributions. To address this, we propose Adaptive Reinforcement Flow Matching (ARFM), the first framework to integrate offline reinforcement learning into the VLA flow-matching paradigm. ARFM introduces an end-to-end differentiable post-training objective and an adaptive loss scaling mechanism that dynamically balances advantage signal preservation against gradient variance control. By unifying flow matching, advantage-weighted policy optimization, and adaptive scaling, ARFM significantly improves action accuracy in both simulation and real-robot experiments. Moreover, it demonstrates strong few-shot learning capability, continual learning adaptability, robustness to distribution shifts, and cross-task generalization—outperforming prior flow-matching and RL-based VLA approaches across multiple benchmarks.
📝 Abstract
Vision-Language-Action (VLA) models based on flow matching have shown excellent performance in general-purpose robotic manipulation tasks. However, the action accuracy of these models on complex downstream tasks is unsatisfactory. One important reason is that these models rely solely on the post-training paradigm of imitation learning, which makes it difficult to have a deeper understanding of the distribution properties of data quality, which is exactly what Reinforcement Learning (RL) excels at. In this paper, we theoretically propose an offline RL post-training objective for VLA flow models and induce an efficient and feasible offline RL fine-tuning algorithm -- Adaptive Reinforced Flow Matching (ARFM). By introducing an adaptively adjusted scaling factor in the VLA flow model loss, we construct a principled bias-variance trade-off objective function to optimally control the impact of RL signal on flow loss. ARFM adaptively balances RL advantage preservation and flow loss gradient variance control, resulting in a more stable and efficient fine-tuning process. Extensive simulation and real-world experimental results show that ARFM exhibits excellent generalization, robustness, few-shot learning, and continuous learning performance.