Balancing Signal and Variance: Adaptive Offline RL Post-Training for VLA Flow Models

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing flow-matching-based vision-language-action (VLA) models suffer from insufficient action accuracy in complex manipulation tasks, primarily because imitation-learning-only post-training fails to model data quality distributions. To address this, we propose Adaptive Reinforcement Flow Matching (ARFM), the first framework to integrate offline reinforcement learning into the VLA flow-matching paradigm. ARFM introduces an end-to-end differentiable post-training objective and an adaptive loss scaling mechanism that dynamically balances advantage signal preservation against gradient variance control. By unifying flow matching, advantage-weighted policy optimization, and adaptive scaling, ARFM significantly improves action accuracy in both simulation and real-robot experiments. Moreover, it demonstrates strong few-shot learning capability, continual learning adaptability, robustness to distribution shifts, and cross-task generalization—outperforming prior flow-matching and RL-based VLA approaches across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models based on flow matching have shown excellent performance in general-purpose robotic manipulation tasks. However, the action accuracy of these models on complex downstream tasks is unsatisfactory. One important reason is that these models rely solely on the post-training paradigm of imitation learning, which makes it difficult to have a deeper understanding of the distribution properties of data quality, which is exactly what Reinforcement Learning (RL) excels at. In this paper, we theoretically propose an offline RL post-training objective for VLA flow models and induce an efficient and feasible offline RL fine-tuning algorithm -- Adaptive Reinforced Flow Matching (ARFM). By introducing an adaptively adjusted scaling factor in the VLA flow model loss, we construct a principled bias-variance trade-off objective function to optimally control the impact of RL signal on flow loss. ARFM adaptively balances RL advantage preservation and flow loss gradient variance control, resulting in a more stable and efficient fine-tuning process. Extensive simulation and real-world experimental results show that ARFM exhibits excellent generalization, robustness, few-shot learning, and continuous learning performance.
Problem

Research questions and friction points this paper is trying to address.

Improving action accuracy in VLA flow models
Addressing limitations of imitation learning post-training
Balancing RL signal impact and flow loss variance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Reinforced Flow Matching algorithm
Adaptively adjusted scaling factor in loss
Balances advantage preservation and variance control
🔎 Similar Papers
No similar papers found.
H
Hongyin Zhang
Westlake University, Hangzhou, China
S
Shiyuan Zhang
University of California, Los Angeles, USA
J
Junxi Jin
Westlake University, Hangzhou, China
Q
Qixin Zeng
Westlake University, Hangzhou, China
Yifan Qiao
Yifan Qiao
Postdoc at University of California, Berkeley
Operating SystemsCloud ComputingML Systems
H
Hongchao Lu
Westlake University, Hangzhou, China
D
Donglin Wang
Westlake University, Hangzhou, China