🤖 AI Summary
To address the challenge of cross-modal (human/heterogeneous robots) and cross-environment video transfer in few-shot robotic imitation learning, this paper proposes TraceGen, a world model operating in 3D trajectory space. Our core innovation is a unified symbolic 3D trajectory–language representation that abstracts visual appearance while preserving geometric structure—enabling efficient video-to-action mapping without object detection or pixel-level reconstruction. Integrated with the TraceForge data pipeline, heterogeneous demonstration videos are automatically distilled into 3D trajectory–language triplets to pretrain transferable 3D motion priors. Experiments demonstrate that TraceGen achieves 80% success rate across four manipulation tasks using only five demonstration videos from the target robot; it further attains 67.5% success when trained solely on smartphone-captured human demonstrations. Moreover, inference is 50–600× faster than state-of-the-art methods.
📝 Abstract
Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - humans and different robots - are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data problem by introducing a unifying, symbolic representation - a compact 3D"trace-space"of scene-level trajectories - that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation-trace-language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50-600x faster inference than state-of-the-art video-based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen's ability to adapt across embodiments without relying on object detectors or heavy pixel-space generation.