🤖 AI Summary
To address low sample efficiency of reinforcement learning in multi-task and continual learning settings, this paper proposes an energy-based adaptive policy transfer method. Our approach dynamically gates teacher policy intervention by linking energy scores with state visitation density theory, enabling out-of-distribution detection: the teacher guides exploration only in states it has encountered, thereby preventing cross-task bias. This mechanism mitigates exploration bias arising from mismatched teacher knowledge transfer. Evaluated on both single-task and multi-task benchmarks, our method achieves an average performance gain of 23% and accelerates convergence by 1.8×, significantly improving sample efficiency and generalization robustness.
📝 Abstract
Reinforcement learning algorithms often suffer from poor sample efficiency, making them challenging to apply in multi-task or continual learning settings. Efficiency can be improved by transferring knowledge from a previously trained teacher policy to guide exploration in new but related tasks. However, if the new task sufficiently differs from the teacher's training task, the transferred guidance may be sub-optimal and bias exploration toward low-reward behaviors. We propose an energy-based transfer learning method that uses out-of-distribution detection to selectively issue guidance, enabling the teacher to intervene only in states within its training distribution. We theoretically show that energy scores reflect the teacher's state-visitation density and empirically demonstrate improved sample efficiency and performance across both single-task and multi-task settings.