🤖 AI Summary
This work addresses the poor generalization of visuomotor policies for embodied agents across morphologies—such as varying sensors and dynamics—as well as low sample efficiency and failure to adapt to unseen environments. To this end, the authors propose CAPO, a novel framework that uniquely integrates contrastive prompt learning with adaptive prompt orchestration. CAPO constructs a learnable prompt pool and dynamically aggregates prompts relevant to the current observation, explicitly modeling and disentangling task-relevant features from domain-specific factors to produce robust state representations. By fusing visual, temporal action, and textual goal information through hybrid contrastive learning, CAPO enables end-to-end policy optimization. It significantly outperforms existing methods in both sample efficiency and final performance, and demonstrates exceptional zero-shot transfer capabilities under drastic environmental changes, including variations in lighting, field of view, and orientation.
📝 Abstract
Learning adaptive visuomotor policies for embodied agents remains a formidable challenge, particularly when facing cross-embodiment variations such as diverse sensor configurations and dynamic properties. Conventional learning approaches often struggle to separate task-relevant features from domain-specific variations (e.g., lighting, field-of-view, and rotation), leading to poor sample efficiency and catastrophic failure in unseen environments. To bridge this gap, we propose ContrAstive Prompt Orchestration (CAPO), a novel approach for learning visuomotor policies that integrates contrastive prompt learning and adaptive prompt orchestration. For prompt learning, we devise a hybrid contrastive learning strategy that integrates visual, temporal action, and text objectives to establish a pool of learnable prompts, where each prompt induces a visual representation encapsulating fine-grained domain factors. Based on these learned prompts, we introduce an adaptive prompt orchestration mechanism that dynamically aggregates these prompts conditioned on current observations. This enables the agent to adaptively construct optimal state representations by identifying dominant domain factors instantaneously. Consequently, the policy optimization is effectively shielded from irrelevant interference, preventing the common issue of overfitting to source domains. Extensive experiments demonstrate that CAPO significantly outperforms state-of-the-art baselines in sample efficiency and asymptotic performance. Crucially, it exhibits superior zero-shot adaptation across unseen target domains characterized by drastic environmental (e.g., illumination) and physical shifts (e.g., field-of-view and rotation), validating its effectiveness as a viable solution for cross-embodiment visuomotor policy adaptation.