Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

In robotic manipulation, vision-based representations lacking embodied priors suffer from low sample efficiency and poor generalization. To address this, we propose ICon, the first method to perform agent–environment contrastive learning at the token level within Vision Transformers, explicitly injecting body-specific inductive biases to disentangle agent-centric and environment-centric features. ICon integrates an auxiliary contrastive loss into end-to-end policy learning, requiring no additional annotations or explicit modeling of robot dynamics. Experiments demonstrate that ICon significantly improves multi-task manipulation performance, achieving higher sample efficiency in both simulation and real-world settings, as well as robust policy transfer across heterogeneous robot platforms. By bridging embodied cognition and visual representation learning, ICon establishes a novel paradigm for learning grounded, agent-aware visual representations.

Technology Category

Application Category

📝 Abstract

Learning effective visual representations for robotic manipulation remains a fundamental challenge due to the complex body dynamics involved in action execution. In this paper, we study how visual representations that carry body-relevant cues can enable efficient policy learning for downstream robotic manipulation tasks. We present $ extbf{I}$nter-token $ extbf{Con}$trast ($ extbf{ICon}$), a contrastive learning method applied to the token-level representations of Vision Transformers (ViTs). ICon enforces a separation in the feature space between agent-specific and environment-specific tokens, resulting in agent-centric visual representations that embed body-specific inductive biases. This framework can be seamlessly integrated into end-to-end policy learning by incorporating the contrastive loss as an auxiliary objective. Our experiments show that ICon not only improves policy performance across various manipulation tasks but also facilitates policy transfer across different robots. The project website: https://github.com/HenryWJL/icon

Problem

Research questions and friction points this paper is trying to address.

Learning visual representations for robotic manipulation

Enhancing policy learning with body-relevant visual cues

Improving policy transfer across different robots

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive learning for token-level ViT representations

Agent-centric visual embeddings with body biases

Seamless integration into end-to-end policy learning

🔎 Similar Papers

SPA: 3D Spatial-Awareness Enables Effective Embodied Representation