TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of cross-modal (human/heterogeneous robots) and cross-environment video transfer in few-shot robotic imitation learning, this paper proposes TraceGen, a world model operating in 3D trajectory space. Our core innovation is a unified symbolic 3D trajectory–language representation that abstracts visual appearance while preserving geometric structure—enabling efficient video-to-action mapping without object detection or pixel-level reconstruction. Integrated with the TraceForge data pipeline, heterogeneous demonstration videos are automatically distilled into 3D trajectory–language triplets to pretrain transferable 3D motion priors. Experiments demonstrate that TraceGen achieves 80% success rate across four manipulation tasks using only five demonstration videos from the target robot; it further attains 67.5% success when trained solely on smartphone-captured human demonstrations. Moreover, inference is 50–600× faster than state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - humans and different robots - are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data problem by introducing a unifying, symbolic representation - a compact 3D"trace-space"of scene-level trajectories - that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation-trace-language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50-600x faster inference than state-of-the-art video-based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen's ability to adapt across embodiments without relying on object detectors or heavy pixel-space generation.
Problem

Research questions and friction points this paper is trying to address.

Learning robot tasks from few demonstrations across different embodiments
Abstracting 3D motion from videos to overcome appearance and environmental differences
Enabling efficient adaptation to new tasks with minimal target data
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D trace-space representation for cross-embodiment learning
World model predicting motion in trace-space, not pixel space
Data pipeline converting videos into 3D traces for pretraining
🔎 Similar Papers
No similar papers found.