🤖 AI Summary
In robotic manipulation, vision-based representations lacking embodied priors suffer from low sample efficiency and poor generalization. To address this, we propose ICon, the first method to perform agent–environment contrastive learning at the token level within Vision Transformers, explicitly injecting body-specific inductive biases to disentangle agent-centric and environment-centric features. ICon integrates an auxiliary contrastive loss into end-to-end policy learning, requiring no additional annotations or explicit modeling of robot dynamics. Experiments demonstrate that ICon significantly improves multi-task manipulation performance, achieving higher sample efficiency in both simulation and real-world settings, as well as robust policy transfer across heterogeneous robot platforms. By bridging embodied cognition and visual representation learning, ICon establishes a novel paradigm for learning grounded, agent-aware visual representations.
📝 Abstract
Learning effective visual representations for robotic manipulation remains a fundamental challenge due to the complex body dynamics involved in action execution. In this paper, we study how visual representations that carry body-relevant cues can enable efficient policy learning for downstream robotic manipulation tasks. We present $ extbf{I}$nter-token $ extbf{Con}$trast ($ extbf{ICon}$), a contrastive learning method applied to the token-level representations of Vision Transformers (ViTs). ICon enforces a separation in the feature space between agent-specific and environment-specific tokens, resulting in agent-centric visual representations that embed body-specific inductive biases. This framework can be seamlessly integrated into end-to-end policy learning by incorporating the contrastive loss as an auxiliary objective. Our experiments show that ICon not only improves policy performance across various manipulation tasks but also facilitates policy transfer across different robots. The project website: https://github.com/HenryWJL/icon