🤖 AI Summary
Monocular depth foundation models, leveraging strong semantic priors, often hallucinate spurious 3D structures—termed “3D Mirage”—in geometrically planar yet perceptually ambiguous regions (e.g., street paintings), posing an unquantified safety risk. This work is the first to systematically expose, quantify, and mitigate this phenomenon. We introduce the first real-world 3D Mirage hallucination benchmark; propose a dual-metric evaluation framework—DCS (Distortion-based Curvature Score) measuring hallucinated non-planarity via Laplacian analysis, and CCS (Contextual Consistency Score) quantifying contextual instability; and design Grounded Self-Distillation, a frozen-teacher/tunable-student framework incorporating plane-aware self-distillation to suppress hallucinations while preserving semantic knowledge. Experiments demonstrate a 42% reduction in DCS and a 38% reduction in CCS, advancing monocular depth estimation from semantics-driven to structure-robust evaluation paradigms.
📝 Abstract
Monocular depth foundation models achieve remarkable generalization by learning large-scale semantic priors, but this creates a critical vulnerability: they hallucinate illusory 3D structures from geometrically planar but perceptually ambiguous inputs. We term this failure the 3D Mirage. This paper introduces the first end-to-end framework to probe, quantify, and tame this unquantified safety risk. To probe, we present 3D-Mirage, the first benchmark of real-world illusions (e.g., street art) with precise planar-region annotations and context-restricted crops. To quantify, we propose a Laplacian-based evaluation framework with two metrics: the Deviation Composite Score (DCS) for spurious non-planarity and the Confusion Composite Score (CCS) for contextual instability. To tame this failure, we introduce Grounded Self-Distillation, a parameter-efficient strategy that surgically enforces planarity on illusion ROIs while using a frozen teacher to preserve background knowledge, thus avoiding catastrophic forgetting. Our work provides the essential tools to diagnose and mitigate this phenomenon, urging a necessary shift in MDE evaluation from pixel-wise accuracy to structural and contextual robustness. Our code and benchmark will be publicly available to foster this exciting research direction.