Balancing Signal and Variance: Adaptive Offline RL Post-Training for VLA Flow Models

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing flow-matching-based vision-language-action (VLA) models suffer from insufficient action accuracy in complex manipulation tasks, primarily because imitation-learning-only post-training fails to model data quality distributions. To address this, we propose Adaptive Reinforcement Flow Matching (ARFM), the first framework to integrate offline reinforcement learning into the VLA flow-matching paradigm. ARFM introduces an end-to-end differentiable post-training objective and an adaptive loss scaling mechanism that dynamically balances advantage signal preservation against gradient variance control. By unifying flow matching, advantage-weighted policy optimization, and adaptive scaling, ARFM significantly improves action accuracy in both simulation and real-robot experiments. Moreover, it demonstrates strong few-shot learning capability, continual learning adaptability, robustness to distribution shifts, and cross-task generalization—outperforming prior flow-matching and RL-based VLA approaches across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models based on flow matching have shown excellent performance in general-purpose robotic manipulation tasks. However, the action accuracy of these models on complex downstream tasks is unsatisfactory. One important reason is that these models rely solely on the post-training paradigm of imitation learning, which makes it difficult to have a deeper understanding of the distribution properties of data quality, which is exactly what Reinforcement Learning (RL) excels at. In this paper, we theoretically propose an offline RL post-training objective for VLA flow models and induce an efficient and feasible offline RL fine-tuning algorithm -- Adaptive Reinforced Flow Matching (ARFM). By introducing an adaptively adjusted scaling factor in the VLA flow model loss, we construct a principled bias-variance trade-off objective function to optimally control the impact of RL signal on flow loss. ARFM adaptively balances RL advantage preservation and flow loss gradient variance control, resulting in a more stable and efficient fine-tuning process. Extensive simulation and real-world experimental results show that ARFM exhibits excellent generalization, robustness, few-shot learning, and continuous learning performance.

Problem

Research questions and friction points this paper is trying to address.

Improving action accuracy in VLA flow models

Addressing limitations of imitation learning post-training

Balancing RL signal impact and flow loss variance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Reinforced Flow Matching algorithm

Adaptively adjusted scaling factor in loss

Balances advantage preservation and variance control

🔎 Similar Papers

Investigating Generalization Behaviours of Generative Flow Networks

2024-02-07arXiv.orgCitations: 2