Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity

πŸ“… 2025-03-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Severe depth ambiguity critically hinders 3D understanding in transparent and multi-layered scenes, and existing single-depth estimation methods fail to model geometric uncertainty. To mitigate depth uncertainty without retraining, this work introduces a multi-hypothesis spatial modeling paradigm. First, we establish MD-3kβ€”the first benchmark dedicated to multi-layer depth estimation. Second, we propose Laplacian Visual Prompts (LVP), a zero-shot method that disentangles implicit multi-layer depth representations from RGB images via frequency-domain transformation. Third, we design a spectral prompting mechanism and a zero-shot deep fusion strategy, coupled with a novel multi-layer spatial relationship annotation and evaluation framework. Our approach significantly improves geometric-conditioned generation, 3D spatial reasoning, and video-level temporal-consistent depth inference. It advances depth estimation toward ambiguity-aware spatial foundation models.

Technology Category

Application Category

πŸ“ Abstract
Depth ambiguity is a fundamental challenge in spatial scene understanding, especially in transparent scenes where single-depth estimates fail to capture full 3D structure. Existing models, limited to deterministic predictions, overlook real-world multi-layer depth. To address this, we introduce a paradigm shift from single-prediction to multi-hypothesis spatial foundation models. We first present exttt{MD-3k}, a benchmark exposing depth biases in expert and foundational models through multi-layer spatial relationship labels and new metrics. To resolve depth ambiguity, we propose Laplacian Visual Prompting (LVP), a training-free spectral prompting technique that extracts hidden depth from pre-trained models via Laplacian-transformed RGB inputs. By integrating LVP-inferred depth with standard RGB-based estimates, our approach elicits multi-layer depth without model retraining. Extensive experiments validate the effectiveness of LVP in zero-shot multi-layer depth estimation, unlocking more robust and comprehensive geometry-conditioned visual generation, 3D-grounded spatial reasoning, and temporally consistent video-level depth inference. Our benchmark and code will be available at https://github.com/Xiaohao-Xu/Ambiguity-in-Space.
Problem

Research questions and friction points this paper is trying to address.

Addresses depth ambiguity in spatial scene understanding.
Introduces multi-hypothesis models for multi-layer depth estimation.
Proposes Laplacian Visual Prompting for zero-shot depth inference.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-hypothesis spatial foundation models
Laplacian Visual Prompting (LVP) technique
Zero-shot multi-layer depth estimation
πŸ”Ž Similar Papers
No similar papers found.
Xiaohao Xu
Xiaohao Xu
Google; University of Michigan, Ann Arbor
Robust Visual IntelligenceAnomaly DetectionVideo&3DComputer VisionRobotics
F
Feng Xue
University of Michigan, Ann Arbor
X
Xiang Li
Carnegie Mellon University
H
Haowei Li
University of Michigan, Ann Arbor
Shusheng Yang
Shusheng Yang
PhD student @ NYU Courant
Computer VisionDeep LearningMachine Learning
T
Tianyi Zhang
Carnegie Mellon University
Matthew Johnson-Roberson
Matthew Johnson-Roberson
Professor of Robotics, Carnegie Mellon University
RoboticsField RoboticsAutonomous VehiclesMarine Robotics
X
Xiaonan Huang
University of Michigan, Ann Arbor