SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the challenge that vision-language models (VLMs) struggle to reason about the locations of occluded or invisible functional objects in 3D scenes. To this end, the authors introduce SceneFunRI, a novel benchmark comprising 855 2D spatial reasoning instances derived via a semi-automatic pipeline from the SceneFun3D dataset. The benchmark requires models to localize unseen functional objects by integrating task instructions with commonsense reasoning. As the first systematic evaluation framework for this task, SceneFunRI features three innovative prompting strategies: strong instruction prompting, reasoning-guided prompting, and Spatial Process of Elimination (SPoE). Experimental results reveal that even the strongest current VLM, Gemini 1.5 Flash, achieves only CAcc@75 = 15.20, mIoU = 0.74, and Dist = 28.65, highlighting both the difficulty of the task and the significant limitations of existing models.

📝 Abstract

In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.

Problem

Research questions and friction points this paper is trying to address.

invisible object localization

vision-language models

spatial reasoning

commonsense reasoning

task-driven

Innovation

Methods, ideas, or system contributions that make the work stand out.

invisible object reasoning

vision-language models

spatial reasoning