DepthFocus: Controllable Depth Estimation for See-Through Scenes

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

Current deep depth estimation models struggle with multi-layer depth ambiguity caused by transparent or reflective objects, producing only static single-depth maps without human-like intention-driven, selective focusing capabilities. This work proposes an intention-driven stereo depth estimation framework that reformulates depth estimation as a controllable focusing process: a scalar depth preference serves as conditional input, and a tunable Vision Transformer enables dynamic feature modulation to support user-specified depth-layer perception. To this end, we introduce the first large-scale synthetic dataset comprising 500K multi-layer depth samples, augmented with real-world penetration-scene data for end-to-end training. Our method achieves state-of-the-art performance on single-depth benchmarks (e.g., BOOSTER) and demonstrates intention-consistent quantitative accuracy on a novel multi-depth benchmark. Moreover, it exhibits strong generalization to unseen transparent/reflective scenes.

Technology Category

Application Category

📝 Abstract

Depth in the real world is rarely singular. Transmissive materials create layered ambiguities that confound conventional perception systems. Existing models remain passive, attempting to estimate static depth maps anchored to the nearest surface, while humans actively shift focus to perceive a desired depth. We introduce DepthFocus, a steerable Vision Transformer that redefines stereo depth estimation as intent-driven control. Conditioned on a scalar depth preference, the model dynamically adapts its computation to focus on the intended depth, enabling selective perception within complex scenes. The training primarily leverages our newly constructed 500k multi-layered synthetic dataset, designed to capture diverse see-through effects. DepthFocus not only achieves state-of-the-art performance on conventional single-depth benchmarks like BOOSTER, a dataset notably rich in transparent and reflective objects, but also quantitatively demonstrates intent-aligned estimation on our newly proposed real and synthetic multi-depth datasets. Moreover, it exhibits strong generalization capabilities on unseen see-through scenes, underscoring its robustness as a significant step toward active and human-like 3D perception.

Problem

Research questions and friction points this paper is trying to address.

Solving layered depth ambiguities in transmissive scenes

Enabling intent-driven depth control for selective perception

Addressing passive depth estimation limitations in see-through scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Steerable Vision Transformer for depth estimation

Intent-driven control with scalar depth preference

Dynamic computation focusing on intended depth

🔎 Similar Papers

No similar papers found.