Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity

📅 2025-03-08

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Severe depth ambiguity critically hinders 3D understanding in transparent and multi-layered scenes, and existing single-depth estimation methods fail to model geometric uncertainty. To mitigate depth uncertainty without retraining, this work introduces a multi-hypothesis spatial modeling paradigm. First, we establish MD-3k—the first benchmark dedicated to multi-layer depth estimation. Second, we propose Laplacian Visual Prompts (LVP), a zero-shot method that disentangles implicit multi-layer depth representations from RGB images via frequency-domain transformation. Third, we design a spectral prompting mechanism and a zero-shot deep fusion strategy, coupled with a novel multi-layer spatial relationship annotation and evaluation framework. Our approach significantly improves geometric-conditioned generation, 3D spatial reasoning, and video-level temporal-consistent depth inference. It advances depth estimation toward ambiguity-aware spatial foundation models.

Technology Category

Application Category

📝 Abstract

Depth ambiguity is a fundamental challenge in spatial scene understanding, especially in transparent scenes where single-depth estimates fail to capture full 3D structure. Existing models, limited to deterministic predictions, overlook real-world multi-layer depth. To address this, we introduce a paradigm shift from single-prediction to multi-hypothesis spatial foundation models. We first present exttt{MD-3k}, a benchmark exposing depth biases in expert and foundational models through multi-layer spatial relationship labels and new metrics. To resolve depth ambiguity, we propose Laplacian Visual Prompting (LVP), a training-free spectral prompting technique that extracts hidden depth from pre-trained models via Laplacian-transformed RGB inputs. By integrating LVP-inferred depth with standard RGB-based estimates, our approach elicits multi-layer depth without model retraining. Extensive experiments validate the effectiveness of LVP in zero-shot multi-layer depth estimation, unlocking more robust and comprehensive geometry-conditioned visual generation, 3D-grounded spatial reasoning, and temporally consistent video-level depth inference. Our benchmark and code will be available at https://github.com/Xiaohao-Xu/Ambiguity-in-Space.

Problem

Research questions and friction points this paper is trying to address.

Addresses depth ambiguity in spatial scene understanding.

Introduces multi-hypothesis models for multi-layer depth estimation.

Proposes Laplacian Visual Prompting for zero-shot depth inference.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-hypothesis spatial foundation models

Laplacian Visual Prompting (LVP) technique

Zero-shot multi-layer depth estimation

🔎 Similar Papers

No similar papers found.