DeFM: Learning Foundation Representations from Depth for Robotics

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the significant gap between depth and RGB modalities in foundational representation learning for robotics, where depth lacks generalizable, self-supervised foundation models. To bridge this gap, we propose DeFM—the first self-supervised foundation model trained exclusively on depth images. Leveraging DINO-style self-distillation over a dataset of 60 million depth images, DeFM learns rich geometric and semantic representations. A novel input normalization strategy is introduced to preserve metric-awareness across multiple scales. DeFM enables efficient distillation into lightweight models and is readily applicable out-of-the-box to diverse tasks including depth-based classification, segmentation, navigation, and manipulation. It achieves state-of-the-art performance across both simulation-to-reality transfer and real-world scenarios, and we release the pretrained models to support further research.

Technology Category

Application Category

📝 Abstract

Depth sensors are widely deployed across robotic platforms, and advances in fast, high-fidelity depth simulation have enabled robotic policies trained on depth observations to achieve robust sim-to-real transfer for a wide range of tasks. Despite this, representation learning for depth modality remains underexplored compared to RGB, where large-scale foundation models now define the state of the art. To address this gap, we present DeFM, a self-supervised foundation model trained entirely on depth images for robotic applications. Using a DINO-style self-distillation objective on a curated dataset of 60M depth images, DeFM learns geometric and semantic representations that generalize to diverse environments, tasks, and sensors. To retain metric awareness across multiple scales, we introduce a novel input normalization strategy. We further distill DeFM into compact models suitable for resource-constrained robotic systems. When evaluated on depth-based classification, segmentation, navigation, locomotion, and manipulation benchmarks, DeFM achieves state-of-the-art performance and demonstrates strong generalization from simulation to real-world environments. We release all our pretrained models, which can be adopted off-the-shelf for depth-based robotic learning without task-specific fine-tuning. Webpage: https://de-fm.github.io/

Problem

Research questions and friction points this paper is trying to address.

depth representation learning

foundation model

robotic perception

sim-to-real transfer

self-supervised learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation model

depth sensing

self-supervised learning