DeFM: Learning Foundation Representations from Depth for Robotics

๐Ÿ“… 2026-01-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

231K/year
๐Ÿค– AI Summary
This work addresses the significant gap between depth and RGB modalities in foundational representation learning for robotics, where depth lacks generalizable, self-supervised foundation models. To bridge this gap, we propose DeFMโ€”the first self-supervised foundation model trained exclusively on depth images. Leveraging DINO-style self-distillation over a dataset of 60 million depth images, DeFM learns rich geometric and semantic representations. A novel input normalization strategy is introduced to preserve metric-awareness across multiple scales. DeFM enables efficient distillation into lightweight models and is readily applicable out-of-the-box to diverse tasks including depth-based classification, segmentation, navigation, and manipulation. It achieves state-of-the-art performance across both simulation-to-reality transfer and real-world scenarios, and we release the pretrained models to support further research.

Technology Category

Application Category

๐Ÿ“ Abstract
Depth sensors are widely deployed across robotic platforms, and advances in fast, high-fidelity depth simulation have enabled robotic policies trained on depth observations to achieve robust sim-to-real transfer for a wide range of tasks. Despite this, representation learning for depth modality remains underexplored compared to RGB, where large-scale foundation models now define the state of the art. To address this gap, we present DeFM, a self-supervised foundation model trained entirely on depth images for robotic applications. Using a DINO-style self-distillation objective on a curated dataset of 60M depth images, DeFM learns geometric and semantic representations that generalize to diverse environments, tasks, and sensors. To retain metric awareness across multiple scales, we introduce a novel input normalization strategy. We further distill DeFM into compact models suitable for resource-constrained robotic systems. When evaluated on depth-based classification, segmentation, navigation, locomotion, and manipulation benchmarks, DeFM achieves state-of-the-art performance and demonstrates strong generalization from simulation to real-world environments. We release all our pretrained models, which can be adopted off-the-shelf for depth-based robotic learning without task-specific fine-tuning. Webpage: https://de-fm.github.io/
Problem

Research questions and friction points this paper is trying to address.

depth representation learning
foundation model
robotic perception
sim-to-real transfer
self-supervised learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation model
depth sensing
self-supervised learning
sim-to-real transfer
metric-aware representation
๐Ÿ”Ž Similar Papers
๐Ÿ’ผ Related Jobs
Vision Foundation Model Research Intern
Intrinsic
Salary Range$57.69โ€”$57.69 USDAt Intrinsic, we are proud to be an equal opportunity workplace. Employment at Intrinsic is based solely on a person's merit and qualifications directly related to professional competence. Intrinsic does not discriminate against any employee or applicant because of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), or any other basis protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. It is Intrinsicโ€™s policy to comply with all applicable national, state and local laws pertaining to nondiscrimination and equal opportunity.
Mountain View, California / Mountain View (US-MTV), Mountain View, California, United States