Focusable Monocular Depth Estimation

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

216K/year
🤖 AI Summary
Existing monocular depth estimation methods treat all pixels uniformly, failing to meet user-specified accuracy requirements for regions of interest. This work proposes a Focusable Depth Estimation (FDE) task, where bounding box or textual prompts guide the model to prioritize depth accuracy, boundary sharpness, and global geometric consistency in target regions. We introduce a novel region-aware depth estimation paradigm, featuring a prompt-conditioning mechanism and a Multi-Scale Spatial Alignment (MSSA) fusion module. Built upon SAM3 and Depth Anything architectures, our approach enables dense prompt injection that balances localized focus with global structure preservation. Evaluated on our newly curated FDE-Bench benchmark, the proposed method significantly outperforms global fine-tuning baselines, particularly in foreground regions and object boundaries. Ablation studies confirm the critical role of spatial alignment in achieving these gains.
📝 Abstract
Monocular depth foundation models generalize well across scenes, yet they are typically optimized with uniform pixel-wise objectives that do not distinguish user-specified or task-relevant target regions from the surrounding context. We therefore introduce Focusable Monocular Depth Estimation (FDE), a region-aware depth estimation task in which, given a specified target region, the model is required to prioritize foreground depth accuracy, preserve sharp boundary transitions, and maintain coherent global scene geometry. To prioritize task-critical region modeling, we propose FocusDepth, a prompt-conditioned monocular relative depth estimation framework that guides depth modeling to focus on target regions via box/text prompts. The core Multi-Scale Spatial-Aligned Fusion (MSSA) in FocusDepth spatially aligns multi-scale features from Segment Anything Model 3 to the Depth Anything family and injects them through scale-specific, gated conditional fusion. This enables dense prompt cue injection without disrupting geometric representations, thereby endowing the depth estimation model with focused perception capability. To study FDE, we establish FDE-Bench, a target-centric monocular relative depth benchmark built from image-target-depth triplets across five datasets, containing 252.9K/72.5K train/val triplets and 972 categories spanning real-world and embodied simulation environments. On FDE-Bench, FocusDepth consistently improves over globally fine-tuned DA2/DA3 baselines under both box and text prompts, with the largest gains appearing in target boundary and foreground regions while preserving global scene geometry. Ablations show that MSSA's spatial alignment is the key design factor, as disrupting prompt-geometry correspondence increases AbsRel by up to 13.8%.
Problem

Research questions and friction points this paper is trying to address.

Monocular Depth Estimation
Region-aware
Focusable Depth
Target Region
Depth Accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Focusable Monocular Depth Estimation
Prompt-conditioned Depth Estimation
Multi-Scale Spatial-Aligned Fusion
Region-aware Depth Modeling
FDE-Bench
Y
Yuxin Du
School of Artificial Intelligence, Shanghai Jiao Tong University
T
Tao Lin
School of Artificial Intelligence, Shanghai Jiao Tong University
Z
Zile Zhong
School of Artificial Intelligence, Shanghai Jiao Tong University
R
Runting Li
School of Artificial Intelligence, Shanghai Jiao Tong University
X
Xiyao Chen
School of Artificial Intelligence, Shanghai Jiao Tong University
J
Jiting Liu
School of Artificial Intelligence, Shanghai Jiao Tong University
C
Chenglin Liu
The Hong Kong University of Science and Technology (Guangzhou)
Ying-Cong Chen
Ying-Cong Chen
Hong Kong University of Science and Technology (Guangzhou)
Computer Vision and Pattern Recognition
Y
Yuqian Fu
King Abdullah University of Science and Technology
Bo Zhao
Bo Zhao
Shanghai Jiao Tong University
Embodied AIMLLMData-centric AI