Pre-training Auto-regressive Robotic Models with 4D Representations

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

The robotics domain lacks efficient foundational model pretraining paradigms, primarily due to the high cost of labeled data and the absence of general-purpose representations capturing physical dynamics. To address this, we propose ARM4R—the first robotic foundation model that employs monocular depth-estimated temporal 3D point trajectories (a 4D representation: 3D spatial coordinates + time) as its core physical-world modeling primitive. ARM4R establishes a geometrically grounded linear mapping between these 4D trajectories and robot states, thereby bridging human-recorded video inputs to robot control via pretraining. Our method integrates unsupervised video pretraining, cross-modal representation alignment, and autoregressive sequence modeling. Extensive experiments across diverse robotic platforms and tasks demonstrate significant improvements in zero-shot and few-shot control performance. These results validate the effectiveness and strong cross-domain generalization capability of the 4D trajectory representation for robotic learning.

Technology Category

Application Category

📝 Abstract

Foundation models pre-trained on massive unlabeled datasets have revolutionized natural language and computer vision, exhibiting remarkable generalization capabilities, thus highlighting the importance of pre-training. Yet, efforts in robotics have struggled to achieve similar success, limited by either the need for costly robotic annotations or the lack of representations that effectively model the physical world. In this paper, we introduce ARM4R, an Auto-regressive Robotic Model that leverages low-level 4D Representations learned from human video data to yield a better pre-trained robotic model. Specifically, we focus on utilizing 3D point tracking representations from videos derived by lifting 2D representations into 3D space via monocular depth estimation across time. These 4D representations maintain a shared geometric structure between the points and robot state representations up to a linear transformation, enabling efficient transfer learning from human video data to low-level robotic control. Our experiments show that ARM4R can transfer efficiently from human video data to robotics and consistently improves performance on tasks across various robot environments and configurations.

Problem

Research questions and friction points this paper is trying to address.

Pre-training robotic models efficiently

Leveraging 4D representations from videos

Transfer learning from human to robotics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-trained Auto-regressive Robotic Models

4D Representations from human videos

Transfer learning from human data

🔎 Similar Papers

3D-MVP: 3D Multiview Pretraining for Robotic Manipulation

2024-06-26Citations: 1

Field AI

Irvine, CA

AI Research Scientist, Robotics