🤖 AI Summary
Reinforcement learning agents exhibit poor generalization under visual or task variations, necessitating costly retraining and hindering policy reuse. To address this, we propose a zero-shot cross-agent representation mapping method that estimates affine or orthogonal transformations between latent spaces using semantically aligned anchor points, enabling fine-tuning-free policy stitching. Our approach is the first to support modular, compositional zero-shot policy recombination—bypassing conventional transfer learning’s reliance on target-domain data or parameter adaptation. Evaluated in the CarRacing environment under concurrent background and task shifts, our method achieves high-performance zero-shot policy composition: average return retains over 95% of the original policy’s performance. This substantially enhances policy robustness and reusability in dynamic environments.
📝 Abstract
Deep Reinforcement Learning (RL) models often fail to generalize when even small changes occur in the environment's observations or task requirements. Addressing these shifts typically requires costly retraining, limiting the reusability of learned policies. In this paper, we build on recent work in semantic alignment to propose a zero-shot method for mapping between latent spaces across different agents trained on different visual and task variations. Specifically, we learn a transformation that maps embeddings from one agent's encoder to another agent's encoder without further fine-tuning. Our approach relies on a small set of"anchor"observations that are semantically aligned, which we use to estimate an affine or orthogonal transform. Once the transformation is found, an existing controller trained for one domain can interpret embeddings from a different (existing) encoder in a zero-shot fashion, skipping additional trainings. We empirically demonstrate that our framework preserves high performance under visual and task domain shifts. We empirically demonstrate zero-shot stitching performance on the CarRacing environment with changing background and task. By allowing modular re-assembly of existing policies, it paves the way for more robust, compositional RL in dynamically changing environments.