π€ AI Summary
Severe depth ambiguity critically hinders 3D understanding in transparent and multi-layered scenes, and existing single-depth estimation methods fail to model geometric uncertainty. To mitigate depth uncertainty without retraining, this work introduces a multi-hypothesis spatial modeling paradigm. First, we establish MD-3kβthe first benchmark dedicated to multi-layer depth estimation. Second, we propose Laplacian Visual Prompts (LVP), a zero-shot method that disentangles implicit multi-layer depth representations from RGB images via frequency-domain transformation. Third, we design a spectral prompting mechanism and a zero-shot deep fusion strategy, coupled with a novel multi-layer spatial relationship annotation and evaluation framework. Our approach significantly improves geometric-conditioned generation, 3D spatial reasoning, and video-level temporal-consistent depth inference. It advances depth estimation toward ambiguity-aware spatial foundation models.
π Abstract
Depth ambiguity is a fundamental challenge in spatial scene understanding, especially in transparent scenes where single-depth estimates fail to capture full 3D structure. Existing models, limited to deterministic predictions, overlook real-world multi-layer depth. To address this, we introduce a paradigm shift from single-prediction to multi-hypothesis spatial foundation models. We first present exttt{MD-3k}, a benchmark exposing depth biases in expert and foundational models through multi-layer spatial relationship labels and new metrics. To resolve depth ambiguity, we propose Laplacian Visual Prompting (LVP), a training-free spectral prompting technique that extracts hidden depth from pre-trained models via Laplacian-transformed RGB inputs. By integrating LVP-inferred depth with standard RGB-based estimates, our approach elicits multi-layer depth without model retraining. Extensive experiments validate the effectiveness of LVP in zero-shot multi-layer depth estimation, unlocking more robust and comprehensive geometry-conditioned visual generation, 3D-grounded spatial reasoning, and temporally consistent video-level depth inference. Our benchmark and code will be available at https://github.com/Xiaohao-Xu/Ambiguity-in-Space.