🤖 AI Summary
This work addresses the core challenge of balancing plasticity (adapting to new tasks) and stability (retaining prior knowledge) in offline reinforcement learning under task-streaming continual evolution. We propose the first diffusion-based decision-making framework integrated with experience replay, enabling joint optimization of conditional sequence modeling and knowledge retention. We introduce the first large-scale continual offline RL benchmark—comprising 90 diverse, multi-domain tasks—and design a replay strategy based on subsampling small historical datasets, coupled with a multi-task continual evaluation protocol. Experiments demonstrate that our method significantly outperforms existing diffusion-based and classical continual learning baselines across multiple tasks, achieving an average performance gain of 12.7% while attaining superior plasticity–stability trade-offs.
📝 Abstract
Artificial neural networks, especially recent diffusion-based models, have shown remarkable superiority in gaming, control, and QA systems, where the training tasks' datasets are usually static. However, in real-world applications, such as robotic control of reinforcement learning (RL), the tasks are changing, and new tasks arise in a sequential order. This situation poses the new challenge of plasticity-stability trade-off for training an agent who can adapt to task changes and retain acquired knowledge. In view of this, we propose a rehearsal-based continual diffusion model, called Continual Diffuser (CoD), to endow the diffuser with the capabilities of quick adaptation (plasticity) and lasting retention (stability). Specifically, we first construct an offline benchmark that contains 90 tasks from multiple domains. Then, we train the CoD on each task with sequential modeling and conditional generation for making decisions. Next, we preserve a small portion of previous datasets as the rehearsal buffer and replay it to retain the acquired knowledge. Extensive experiments on a series of tasks show CoD can achieve a promising plasticity-stability trade-off and outperform existing diffusion-based methods and other representative baselines on most tasks.