🤖 AI Summary
This paper addresses the problem of generalizable multi-view geometric reconstruction—recovering spatially consistent 3D geometry from an arbitrary number of visual inputs, under either known or unknown camera poses. The proposed method employs only a standard Transformer backbone and a single direct depth-ray prediction objective, eliminating complex multi-task designs. It introduces a teacher–student distillation framework coupled with a lightweight decoder to jointly model multi-view geometry and camera pose estimation in a unified manner. Evaluated on a newly constructed visual-geometric benchmark, the approach achieves average improvements of 44.3% in camera pose accuracy and 25.1% in geometric reconstruction fidelity, while also outperforming Depth Anything 2 in monocular depth estimation. Its core contribution lies in achieving strong generalization with a minimalist architecture—marking the first end-to-end unified modeling framework for joint multi-view geometric reconstruction and pose estimation.
📝 Abstract
We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.