🤖 AI Summary
Estimating multi-layer depth for transparent objects—simultaneously perceiving both the transparent surface and occluded background objects—is critical for robotic manipulation, yet existing methods exhibit limited performance. To address this, we introduce LayeredDepth, the first benchmark dataset integrating real (1,500 images) and synthetically rendered (15,300 images) multi-layer depth data, and propose the first standardized multi-layer depth annotation paradigm. Our method leverages procedural rendering to generate synthetic images with pixel-accurate ground-truth multi-layer depth, followed by fine-tuning a single-layer depth model and cross-domain transfer learning. Experiments show that the fine-tuned model achieves 75.20% four-point accuracy on the real-world subset—outperforming baselines by 20.06%. Remarkably, even models trained solely on synthetic data demonstrate strong cross-domain generalization for multi-layer depth estimation. This work establishes a new benchmark, introduces a principled annotation framework, and provides an effective methodology for depth perception in transparent scenes.
📝 Abstract
Transparent objects are common in daily life, and understanding their multi-layer depth information -- perceiving both the transparent surface and the objects behind it -- is crucial for real-world applications that interact with transparent materials. In this paper, we introduce LayeredDepth, the first dataset with multi-layer depth annotations, including a real-world benchmark and a synthetic data generator, to support the task of multi-layer depth estimation. Our real-world benchmark consists of 1,500 images from diverse scenes, and evaluating state-of-the-art depth estimation methods on it reveals that they struggle with transparent objects. The synthetic data generator is fully procedural and capable of providing training data for this task with an unlimited variety of objects and scene compositions. Using this generator, we create a synthetic dataset with 15,300 images. Baseline models training solely on this synthetic dataset produce good cross-domain multi-layer depth estimation. Fine-tuning state-of-the-art single-layer depth models on it substantially improves their performance on transparent objects, with quadruplet accuracy on our benchmark increased from 55.14% to 75.20%. All images and validation annotations are available under CC0 at https://layereddepth.cs.princeton.edu.