🤖 AI Summary
This paper introduces the novel task of *unobserved object detection*, aiming to detect and localize objects that are either occluded or outside the field of view yet physically proximal in a single image. Methodologically, it proposes a counterfactual generative reasoning framework that jointly leverages 2D/3D diffusion models, vision-language models (VLMs), multi-view geometric priors, and cross-modal feature alignment. Key contributions include: (1) the first formal definition of unobserved object detection across 2D, 2.5D, and 3D settings; (2) the first comprehensive evaluation benchmark incorporating metrics for localization accuracy, geometric plausibility, and semantic consistency; and (3) empirical validation showing that pre-trained generative models possess strong implicit scene understanding—achieving significant improvements over conventional detection and 3D reconstruction baselines on RealEstate10k and NYU Depth v2. The work establishes a new paradigm for spatial reasoning grounded in generative priors.
📝 Abstract
Can objects that are not visible in an image -- but are in the vicinity of the camera -- be detected? This study introduces the novel tasks of 2D, 2.5D and 3D unobserved object detection for predicting the location of nearby objects that are occluded or lie outside the image frame. We adapt several state-of-the-art pre-trained generative models to address this task, including 2D and 3D diffusion models and vision-language models, and show that they can be used to infer the presence of objects that are not directly observed. To benchmark this task, we propose a suite of metrics that capture different aspects of performance. Our empirical evaluation on indoor scenes from the RealEstate10k and NYU Depth v2 datasets demonstrate results that motivate the use of generative models for the unobserved object detection task.