LACE: Latent Visual Representation for Cross-Embodiment Learning

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenge of cross-embodiment learning arising from visual representation discrepancies between human demonstrations and robot executions. To bridge this gap, the authors propose aligning human and robot visual representations within the latent space of a self-supervised vision backbone such as DINO. Their approach requires only a single robot demonstration, leveraging forward kinematics to automatically generate sparse correspondence labels. Alignment is achieved through joint optimization of a semantic alignment loss and Gram matrix regularization, enabling correspondence from local features to semantic-level representations. Experimental results demonstrate that the proposed method, LACE-DINO, improves zero-shot policy success rates by 65% over the baseline DINO model and maintains substantial advantages in low-data regimes and out-of-distribution scenarios.

📝 Abstract

Cross-embodiment learning from human demonstrations is hindered by the visual gap between human and robot embodiments. While self-supervised learning (SSL) backbones encode rich inter-class semantics of general objects, we show they fail to establish correspondence between human and robot hands. We propose LACE, a framework that aligns human and robot visual representations in the latent space of these backbones by leveraging correspondences between shared body parts across embodiments as sparse supervision. These annotations can be automatically obtained via forward kinematics, and single robot demonstration is sufficient to train the model. Our semantic alignment loss matches distributions incurred by corresponding features, lifting patch-level supervision to semantic-level alignment, while a Gram loss preserves pretrained feature quality. This alignment enables robot policies to leverage abundant human data when robot demonstrations are scarce: in zero-shot transfer, policies using LACE-DINO outperform those using DINO by a large margin (65\%), with consistent gains in low-data regimes and out-of-distribution environments.

Problem

Research questions and friction points this paper is trying to address.

cross-embodiment learning

visual gap

human-robot correspondence

latent representation

visual alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-embodiment learning

latent alignment

self-supervised learning