🤖 AI Summary
This work addresses the challenge of balancing stability and plasticity in embodied agents performing continual learning in open-world environments. To this end, the authors propose a continual reinforcement learning framework tailored for vision–language–action models. The approach employs an asymmetric regulation mechanism: it constrains the magnitude of the advantage function on previously learned tasks to mitigate catastrophic forgetting, while permitting controlled policy updates on new tasks to facilitate adaptation. This is further enhanced by a dual-critic architecture, a goal-conditioned value function (GCVF), and a policy divergence constraint, collectively ensuring semantic consistency and learning flexibility. Evaluated on the LIBERO benchmark, the method demonstrates significant improvements over existing approaches in both resistance to forgetting and forward transfer capability.
📝 Abstract
Lifelong learning is critical for embodied agents in open-world environments, where reinforcement learning fine-tuning has emerged as an important paradigm to enable Vision-Language-Action (VLA) models to master dexterous manipulation through environmental interaction. Thus, Continual Reinforcement Learning (CRL) is a promising pathway for deploying VLA models in lifelong robotic scenarios, yet balancing stability (retaining old skills) and plasticity (learning new ones) remains a formidable challenge for existing methods. We introduce CRL-VLA, a framework for continual post-training of VLA models with rigorous theoretical bounds. We derive a unified performance bound linking the stability-plasticity trade-off to goal-conditioned advantage magnitude, scaled by policy divergence. CRL-VLA resolves this dilemma via asymmetric regulation: constraining advantage magnitudes on prior tasks while enabling controlled growth on new tasks. This is realized through a simple but effective dual-critic architecture with novel Goal-Conditioned Value Formulation (GCVF), where a frozen critic anchors semantic consistency and a trainable estimator drives adaptation. Experiments on the LIBERO benchmark demonstrate that CRL-VLA effectively harmonizes these conflicting objectives, outperforming baselines in both anti-forgetting and forward adaptation.