3PoinTr: 3D Point Tracks for Robot Manipulation Pretraining from Casual Videos

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the challenges of robot policy learning, which typically relies on large-scale demonstration data and suffers from embodiment mismatch when learning from in-the-wild human videos. To overcome these limitations, the authors propose using 3D point trajectories as an embodiment-agnostic intermediate representation and introduce a unified Transformer model—based on the Perceiver IO architecture—that jointly pretrains trajectory prediction and behavioral cloning tasks. This approach enables effective pretraining directly on uncurated human videos for the first time, with the lightweight 3D trajectory representation preserving supervisory signals even under partial occlusion. Remarkably, the method achieves strong generalization in both simulation and real-world environments using only 20 action-labeled robot demonstrations, significantly outperforming existing behavioral cloning and video-based pretraining baselines.

Technology Category

Application Category

📝 Abstract

Data-efficient training of robust robot policies is the key to unlocking automation in a wide array of novel tasks. Current systems require large volumes of demonstrations to achieve robustness, which is impractical in many applications. Learning policies directly from human videos is a promising alternative that removes teleoperation costs, but it shifts the challenge toward overcoming the embodiment gap (differences in kinematics and strategies between robots and humans), often requiring restrictive and carefully choreographed human motions. We propose 3PoinTr, a method for pretraining robot policies from casual and unconstrained human videos, enabling learning from motions natural for humans. 3PoinTr uses a transformer architecture to predict 3D point tracks as an intermediate embodiment-agnostic representation. 3D point tracks encode goal specifications, scene geometry, and spatiotemporal relationships. We use a Perceiver IO architecture to extract a compact representation for sample-efficient behavior cloning, even when point tracks violate downstream embodiment-specific constraints. We conduct thorough evaluation on simulated and real-world tasks, and find that 3PoinTr achieves robust spatial generalization on diverse categories of manipulation tasks with only 20 action-labeled robot demonstrations. 3PoinTr outperforms the baselines, including behavior cloning methods, as well as prior methods for pretraining from human videos. We also provide evaluations of 3PoinTr's 3D point track predictions compared to an existing point track prediction baseline. We find that 3PoinTr produces more accurate and higher quality point tracks due to a lightweight yet expressive architecture built on a single transformer, in addition to a training formulation that preserves supervision of partially occluded points. Project page: https://adamhung60.github.io/3PoinTr/.

Problem

Research questions and friction points this paper is trying to address.

robot manipulation

embodiment gap

human videos

data-efficient learning

3D point tracks

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D point tracks

embodiment-agnostic representation

transformer-based pretraining