Depth Anything 3: Recovering the Visual Space from Any Views

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This paper addresses the problem of generalizable multi-view geometric reconstruction—recovering spatially consistent 3D geometry from an arbitrary number of visual inputs, under either known or unknown camera poses. The proposed method employs only a standard Transformer backbone and a single direct depth-ray prediction objective, eliminating complex multi-task designs. It introduces a teacher–student distillation framework coupled with a lightweight decoder to jointly model multi-view geometry and camera pose estimation in a unified manner. Evaluated on a newly constructed visual-geometric benchmark, the approach achieves average improvements of 44.3% in camera pose accuracy and 25.1% in geometric reconstruction fidelity, while also outperforming Depth Anything 2 in monocular depth estimation. Its core contribution lies in achieving strong generalization with a minimalist architecture—marking the first end-to-end unified modeling framework for joint multi-view geometric reconstruction and pose estimation.

Technology Category

Application Category

📝 Abstract

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

Problem

Research questions and friction points this paper is trying to address.

Predicting spatially consistent geometry from arbitrary visual inputs

Recovering camera poses and visual rendering without complex architectures

Establishing new benchmark for multi-view geometry estimation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single transformer backbone without architectural specialization

Singular depth-ray prediction target for simplicity

Teacher-student training paradigm for detailed generalization

🔎 Similar Papers

Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View