π€ AI Summary
This work addresses the state distribution shift problem in imitation learning caused by limited expert demonstration coverage by proposing a robust, adaptive offline-online collaborative imitation learning framework. During the offline phase, the method expands policy coverage by integrating supplementary demonstrations through a discriminator; in the online phase, it incorporates a distribution shift detection mechanism and leverages self-supervised learning to continuously adapt to environmental changes using newly collected experiences. The proposed multi-stage lifelong learning strategy effectively enhances the policyβs robustness and generalization capability. Experimental results on MuJoCo benchmarks demonstrate that the approach significantly outperforms existing baselines, exhibiting superior robustness to distribution shifts and stronger online adaptation performance.
π Abstract
Distribution shift in imitation learning refers to the problem that the agent cannot plan proper actions for a state that has not been visited during the training. This problem can be largely attributed to the inherently narrow state-action coverage provided by expert demonstrations over the full environment. In this paper, we propose a robust offline to adaptive online imitation learning framework that handles the distribution shift problem in a lifelong, multi-phase scheme. In the offline learning phase, we leverage supplementary demonstrations to broaden the state-action coverage of the policy by utilizing a discriminator to effectively train the policy with supplementary demonstrations, thereby enhancing the robustness of the policy to distribution shift. In the subsequent online inference phase, our framework detects the occurrence of distribution shift and conducts self-supervised imitation learning from online experiences to adapt the policy to the online environments. Through extensive evaluations in MuJoCo environments, we demonstrate that our method exhibits better robustness to distribution shift and better adaptation performance to online environments than the baseline algorithms, which indicates superior performance of our framework against the distribution shift.