🤖 AI Summary
This work addresses catastrophic forgetting and the stability-plasticity dilemma in continual reinforcement learning by proposing a decoupled teacher-student framework. Specifically, it transfers distributed reinforcement learning policies—each independently trained on a distinct task—into a shared student model via continual policy distillation. The approach uniquely decouples distributed RL teachers from the continual distillation process and integrates a Mixture-of-Experts (MoE) architecture with experience replay to simultaneously enhance multi-task generalization and mitigate forgetting. Evaluated on the Meta-World benchmark, the student model recovers over 85% of the teachers’ performance while keeping task-level forgetting below 10%.
📝 Abstract
Continual Reinforcement Learning (CRL) aims to develop lifelong learning agents to continuously acquire knowledge across diverse tasks while mitigating catastrophic forgetting. This requires efficiently managing the stability-plasticity dilemma and leveraging prior experience to rapidly generalize to novel tasks. While various enhancement strategies for both aspects have been proposed, achieving scalable performance by directly applying RL to sequential task streams remains challenging. In this paper, we propose a novel teacher-student framework that decouples CRL into two independent processes: training single-task teacher models through distributed RL and continually distilling them into a central generalist model. This design is motivated by the observation that RL excels at solving single tasks, while policy distillation -- a relatively stable supervised learning process -- is well aligned with large foundation models and multi-task learning. Moreover, a mixture-of-experts (MoE) architecture and a replay-based approach are employed to enhance the plasticity and stability of the continual policy distillation process. Extensive experiments on the Meta-World benchmark demonstrate that our framework enables efficient continual RL, recovering over 85% of teacher performance while constraining task-wise forgetting to within 10%.