4D Visual Pre-training for Robot Learning

๐Ÿ“… 2025-08-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

203K/year
๐Ÿค– AI Summary
Existing vision pretraining predominantly relies on 2D images, neglecting the intrinsic 3D structure of the physical world and hindered by the scarcity of large-scale annotated 3D data. To address this, we propose FVPโ€”the first 4D (3D spatial + temporal) visual pretraining framework tailored for real-world robotic learning. FVP formulates pretraining as a spatiotemporal point cloud prediction task and employs diffusion models for self-supervised learning on large-scale RGB-D video sequences. It establishes the inaugural 4D visual pretraining paradigm, unifying and enhancing diverse 3D representation capabilities. Experiments demonstrate that FVP boosts the average success rate of 3D Diffusion Policy by 28% across 12 real-world manipulation tasks and achieves state-of-the-art performance in imitation learning. Moreover, FVP exhibits strong generalization across different encoders and datasets.

Technology Category

Application Category

๐Ÿ“ Abstract
General visual representations learned from web-scale datasets for robotics have achieved great success in recent years, enabling data-efficient robot learning on manipulation tasks; yet these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world. However, due to the scarcity of large-scale 3D data, it is still hard to extract a universal 3D representation from web datasets. Instead, we are seeking a general visual pre-training framework that could improve all 3D representations as an alternative. Our framework, called FVP, is a novel 4D Visual Pre-training framework for real-world robot learning. FVP frames the visual pre-training objective as a next-point-cloud-prediction problem, models the prediction model as a diffusion model, and pre-trains the model on the larger public datasets directly. Across twelve real-world manipulation tasks, FVP boosts the average success rate of 3D Diffusion Policy (DP3) for these tasks by 28%. The FVP pre-trained DP3 achieves state-of-the-art performance across imitation learning methods. Moreover, the efficacy of FVP adapts across various point cloud encoders and datasets. Finally, we apply FVP to the RDT-1B, a larger Vision-Language-Action robotic model, enhancing its performance on various robot tasks. Our project page is available at: https://4d- visual-pretraining.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Addressing the scarcity of large-scale 3D data for robot learning
Developing universal 3D representation from web datasets for robotics
Improving 3D representations through 4D visual pre-training framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

4D visual pre-training framework
next-point-cloud-prediction diffusion model
pre-trained on large public datasets
๐Ÿ”Ž Similar Papers
๐Ÿ’ผ Related Jobs
Vision Foundation Model Research Intern
Intrinsic
Salary Range$57.69โ€”$57.69 USDAt Intrinsic, we are proud to be an equal opportunity workplace. Employment at Intrinsic is based solely on a person's merit and qualifications directly related to professional competence. Intrinsic does not discriminate against any employee or applicant because of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), or any other basis protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. It is Intrinsicโ€™s policy to comply with all applicable national, state and local laws pertaining to nondiscrimination and equal opportunity.
Mountain View, California / Mountain View (US-MTV), Mountain View, California, United States