DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the limited generalization of existing vision–language–action (VLA) models under multitask settings and distribution shifts, which stems from their reliance on task-specific reinforcement learning optimizers. To overcome this, the authors propose a two-stage optimization framework: first, an information-theoretic approach extracts task-agnostic latent representations shared across tasks; second, a dynamic mixture reinforcement learning residual mechanism fine-tunes the policy. This method uniquely integrates dynamic grouped residual optimization with cross-task representation learning, effectively mitigating representational interference during multitask training. Experiments on LIBERO, RoboTwin2, and real robotic platforms demonstrate that the proposed approach significantly outperforms strong baselines, exhibiting superior generalization and robustness in both multitask learning and out-of-distribution scenarios.

📝 Abstract

Recent progress in Reinforcement Learning (RL) provides a principled approach to optimizing Vision-Language-Action (VLA) models, facilitating a shift from trajectory imitation to active learning in the task environment. Despite improvements in control precision, most RL optimizers remain task-specific, which reduces VLA models from generalist controllers to policies that overfit to a narrow set of tasks. In this study, we conduct an in-depth analysis of this phenomenon and highlight the importance of cross-task feature representations for improving the generalizability of VLA models. Motivated by this finding, we introduce DyGRO-VLA, a two-stage optimization framework that 1) effectively captures cross-task latent representations based on information-theoretic principles, and 2) dynamically refines policy optimization via a mixture-of-RL-residuals. DyGRO-VLA enables the RL optimizer to exploit task-relevant latent information while strategically mitigating adverse interference on the learned representations throughout the optimization process. We evaluate our approach on LIBERO, RoboTwin2 benchmarks, and further validate it on real world, demonstrating consistent improvements over strong baselines under multi-task training and distribution shift.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

Reinforcement Learning

Cross-task Generalization

Task-specific Overfitting

Multi-task Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action

cross-task generalization

dynamic grouped residual optimization