Demo-JEPA: Joint-Embedding Predictive Architecture for One-shot Cross-Embodiment Imitation

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the challenge of imitation learning across heterogeneous embodiments, where traditional approaches fail due to reliance on aligned action spaces and inability to generalize from human or structurally different demonstrators. The authors propose a cross-embodiment imitation framework that interprets visual demonstrations as implicit specifications of future goal states. By decoupling demonstration intent from execution policy through a shared predictive representation space, the method generates compatible latent trajectories from a single visual demonstration and the agent’s own interaction experience—without requiring action-level correspondences. Leveraging a JEPA-based world model, representation learning, and model-based planning via forward dynamics, the approach matches the performance of task-specific planners on RLBench and real-world manipulation tasks, and significantly outperforms existing methods on unseen tasks and across diverse robotic embodiments.

📝 Abstract

Robotic imitation learning is often treated as reproducing demonstrated actions, but actions are inherently embodiment-specific. When demonstrations come from humans or robots with different morphology, kinematics, or action spaces, this action-centric view requires shared action spaces, heuristic retargeting, or large-scale multi-embodiment co-training. We instead view demonstrations as implicit specifications of future goals: the target agent should infer what state the demonstrator is trying to realize, rather than how the demonstrator executes it. We propose Demo-JEPA, a cross-embodiment imitation framework that decouples demonstration intent from embodiment-specific execution. Built on a JEPA-based world model, Demo-JEPA translates source visual demonstrations into target-compatible future latent trajectories in a shared predictive representation space. The target agent then uses these latent trajectories as subgoals and realizes them through planning under its own learned forward dynamics. Because Demo-JEPA avoids action-level correspondence and requires only visual demonstrations plus the target agent's own interaction experience, it supports flexible imitation across heterogeneous embodiments. Experiments on RLBench and real-world manipulation tasks show that Demo-JEPA matches specialized in-domain planners and generalizes to unseen tasks and embodiment configurations where prior methods fail.

Problem

Research questions and friction points this paper is trying to address.

cross-embodiment imitation

one-shot imitation

embodiment heterogeneity

action-space mismatch

visual demonstration

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-embodiment imitation

joint-embedding predictive architecture

visual demonstration

latent trajectory planning

one-shot imitation

🔎 Similar Papers

No similar papers found.