MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the challenges of metric depth estimation from large-scale heterogeneous 3D data, which are often hindered by sensor noise, camera-dependent biases, and metric ambiguity. The authors propose a scalable pretraining framework that leverages a universal sparse metric prompting mechanism to learn metric depth directly from diverse, noisy 3D sources, without requiring handcrafted prompts or task-specific architectures. Their approach reveals, for the first time, clear scaling laws in metric depth estimation and effectively disentangles spatial reasoning from sensor-induced biases. Pretrained on approximately 20 million image-depth pairs—aggregating reconstructed, captured, and rendered data—and built upon a Vision Transformer (ViT) with a prompt-driven learning paradigm, the model achieves state-of-the-art performance across multiple tasks, including monocular depth estimation, camera intrinsics recovery, 3D reconstruction, and vision-language-action (VLA) planning, substantially enhancing spatial intelligence in multimodal foundation models.

Technology Category

Application Category

📝 Abstract

Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image-depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate-for the first time-a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such as depth completion, super-resolution and Radar-camera fusion, while its distilled prompt-free student achieves state-of-the-art results on monocular depth estimation, camera intrinsics recovery, single/multi-view metric 3D reconstruction, and VLA planning. We also show that using pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real-world metric perception. We open-source MetricAnything at http://metric-anything.github.io/metric-anything-io/ to support community research.

Problem

Research questions and friction points this paper is trying to address.

metric depth estimation

heterogeneous sensor noise

camera-dependent biases

metric ambiguity

scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Metric Depth Estimation

Scaling Laws

Sparse Metric Prompt

Heterogeneous 3D Data

Vision Foundation Model

🔎 Similar Papers

Unsupervised Collaborative Metric Learning with Mixed-Scale Groups for General Object Retrieval

2024-03-16arXiv.orgCitations: 0

On the effects of similarity metrics in decentralized deep learning under distributional shift

2024-09-16arXiv.orgCitations: 0