Zero-Shot Depth from Defocus

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited zero-shot generalization capability of focus-stack-based depth estimation in unseen scenes by proposing the FOSSA network architecture and the real-world ZEDD benchmark. Built upon a Transformer backbone, the method introduces a stack attention mechanism tailored for depth-from-defocus (DfD) and incorporates focal distance embeddings. It also establishes a novel training paradigm that synthesizes focus stacks from generic RGB-D data. Key contributions include the first stack attention layer designed for DfD, the large-scale ZEDD benchmark, and an effective cross-domain transfer strategy. Experimental results demonstrate that the proposed approach significantly outperforms existing methods on ZEDD and other benchmarks, achieving up to a 55.7% reduction in depth estimation error.
📝 Abstract
Depth from Defocus (DfD) is the task of estimating a dense metric depth map from a focus stack. Unlike previous works overfitting to a certain dataset, this paper focuses on the challenging and practical setting of zero-shot generalization. We first propose a new real-world DfD benchmark ZEDD, which contains 8.3x more scenes and significantly higher quality images and ground-truth depth maps compared to previous benchmarks. We also design a novel network architecture named FOSSA. FOSSA is a Transformer-based architecture with novel designs tailored to the DfD task. The key contribution is a stack attention layer with a focus distance embedding, allowing efficient information exchange across the focus stack. Finally, we develop a new training data pipeline allowing us to utilize existing large-scale RGBD datasets to generate synthetic focus stacks. Experiment results on ZEDD and other benchmarks show a significant improvement over the baselines, reducing errors by up to 55.7%. The ZEDD benchmark is released at https://zedd.cs.princeton.edu. The code and checkpoints are released at https://github.com/princeton-vl/FOSSA.
Problem

Research questions and friction points this paper is trying to address.

Depth from Defocus
zero-shot generalization
focus stack
metric depth estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Shot Generalization
Depth from Defocus
Transformer Architecture
Stack Attention
Synthetic Focus Stack
🔎 Similar Papers
No similar papers found.
Yiming Zuo
Yiming Zuo
Princeton University
Computer Vision
Hongyu Wen
Hongyu Wen
PhD student, Princeton University
V
Venkat Subramanian
Department of Computer Science, Princeton University
P
Patrick Chen
Department of Computer Science, Princeton University
K
Karhan Kayan
Department of Computer Science, Princeton University
Mario Bijelic
Mario Bijelic
Princeton University
Autonomous DrivingComputer VisionComputational ImagingMachine Learning
Felix Heide
Felix Heide
Princeton University | Torc Robotics
Jia Deng
Jia Deng
Princeton University
computer vision