Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses a key limitation in existing monocular depth estimation methods, which typically assume that geometric information is uniformly distributed across all layers of vision foundation models, thereby overlooking the non-uniform presence of 3D structural cues. Through a systematic analysis of DINOv3, the study reveals that deeper features exhibit stronger geometric expressiveness. Building on this insight, the authors propose an adaptive feature recombination mechanism that treats the final layer as a geometric anchor: complementary intermediate layers are selected based on a minimal similarity criterion and fused via lightweight linear adapters. This approach departs from conventional uniform sampling paradigms and achieves significant improvements in both accuracy and generalization across multiple benchmarks, validating the effectiveness of the proposed strategy.

📝 Abstract

Monocular depth estimation (MDE) is a fundamental yet inherently ill-posed task. Recent vision foundation models (VFMs), particularly DINO-based transformers, have significantly improved accuracy and generalization for dense prediction. Prior works generally follow a unified paradigm: sampling a fixed set of intermediate transformer layers at uniform intervals to build multi-scale features. This common practice implicitly assumes that geometric information is uniformly distributed across layers, which may underutilize the structural 3D cues encoded in VFMs. In this study, we present a systematic layer-wise analysis of DINOv3, revealing that 3D information is distributed non-uniformly: deeper layers exhibit stronger depth predictability and better capture inter-sample geometric variation. Motivated by this, we introduce a Last-Layer-Centric Feature Recombination (LFR) module to enhance geometric expressiveness. LFR treats the final layer as a geometric anchor and adaptively selects complementary intermediate layers according to a minimal-similarity criterion. Selected features are fused with the last-layer representation via compact linear adapters.Extensive experiments show that LFR module consistently improves MDE accuracy and achieves state-of-the-art performance. Our analysis sheds light on how geometric knowledge is organized within VFMs and offers an efficient strategy for unlocking their potential in dense 3D tasks.

Problem

Research questions and friction points this paper is trying to address.

Monocular Depth Estimation

Vision Foundation Models

3D Geometric Knowledge

Feature Representation

Layer-wise Analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Last-Layer-Centric

Feature Recombination

DINOv3