🤖 AI Summary
This work addresses the challenge of imitation learning across heterogeneous embodiments, where traditional approaches fail due to reliance on aligned action spaces and inability to generalize from human or structurally different demonstrators. The authors propose a cross-embodiment imitation framework that interprets visual demonstrations as implicit specifications of future goal states. By decoupling demonstration intent from execution policy through a shared predictive representation space, the method generates compatible latent trajectories from a single visual demonstration and the agent’s own interaction experience—without requiring action-level correspondences. Leveraging a JEPA-based world model, representation learning, and model-based planning via forward dynamics, the approach matches the performance of task-specific planners on RLBench and real-world manipulation tasks, and significantly outperforms existing methods on unseen tasks and across diverse robotic embodiments.
📝 Abstract
Robotic imitation learning is often treated as reproducing demonstrated actions, but actions are inherently embodiment-specific. When demonstrations come from humans or robots with different morphology, kinematics, or action spaces, this action-centric view requires shared action spaces, heuristic retargeting, or large-scale multi-embodiment co-training. We instead view demonstrations as implicit specifications of future goals: the target agent should infer what state the demonstrator is trying to realize, rather than how the demonstrator executes it. We propose Demo-JEPA, a cross-embodiment imitation framework that decouples demonstration intent from embodiment-specific execution. Built on a JEPA-based world model, Demo-JEPA translates source visual demonstrations into target-compatible future latent trajectories in a shared predictive representation space. The target agent then uses these latent trajectories as subgoals and realizes them through planning under its own learned forward dynamics. Because Demo-JEPA avoids action-level correspondence and requires only visual demonstrations plus the target agent's own interaction experience, it supports flexible imitation across heterogeneous embodiments. Experiments on RLBench and real-world manipulation tasks show that Demo-JEPA matches specialized in-domain planners and generalizes to unseen tasks and embodiment configurations where prior methods fail.