Demo-JEPA: Joint-Embedding Predictive Architecture for One-shot Cross-Embodiment Imitation

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

222K/year
🤖 AI Summary
This work addresses the challenge of imitation learning across heterogeneous embodiments, where traditional approaches fail due to reliance on aligned action spaces and inability to generalize from human or structurally different demonstrators. The authors propose a cross-embodiment imitation framework that interprets visual demonstrations as implicit specifications of future goal states. By decoupling demonstration intent from execution policy through a shared predictive representation space, the method generates compatible latent trajectories from a single visual demonstration and the agent’s own interaction experience—without requiring action-level correspondences. Leveraging a JEPA-based world model, representation learning, and model-based planning via forward dynamics, the approach matches the performance of task-specific planners on RLBench and real-world manipulation tasks, and significantly outperforms existing methods on unseen tasks and across diverse robotic embodiments.
📝 Abstract
Robotic imitation learning is often treated as reproducing demonstrated actions, but actions are inherently embodiment-specific. When demonstrations come from humans or robots with different morphology, kinematics, or action spaces, this action-centric view requires shared action spaces, heuristic retargeting, or large-scale multi-embodiment co-training. We instead view demonstrations as implicit specifications of future goals: the target agent should infer what state the demonstrator is trying to realize, rather than how the demonstrator executes it. We propose Demo-JEPA, a cross-embodiment imitation framework that decouples demonstration intent from embodiment-specific execution. Built on a JEPA-based world model, Demo-JEPA translates source visual demonstrations into target-compatible future latent trajectories in a shared predictive representation space. The target agent then uses these latent trajectories as subgoals and realizes them through planning under its own learned forward dynamics. Because Demo-JEPA avoids action-level correspondence and requires only visual demonstrations plus the target agent's own interaction experience, it supports flexible imitation across heterogeneous embodiments. Experiments on RLBench and real-world manipulation tasks show that Demo-JEPA matches specialized in-domain planners and generalizes to unseen tasks and embodiment configurations where prior methods fail.
Problem

Research questions and friction points this paper is trying to address.

cross-embodiment imitation
one-shot imitation
embodiment heterogeneity
action-space mismatch
visual demonstration
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-embodiment imitation
joint-embedding predictive architecture
visual demonstration
latent trajectory planning
one-shot imitation
🔎 Similar Papers
No similar papers found.
J
Jingyang He
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
G
Guangrun Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Jieyu Zhang
Jieyu Zhang
University of Washington
Data-Centric AIAgentic AIMultimodal ModelsMachine LearningComputer Vision
Chengkai Hou
Chengkai Hou
Peking University
Robot
Zhengping Che
Zhengping Che
X-Humanoid
Embodied AIDeep Learning
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models