🤖 AI Summary
This work addresses the challenge that current vision-language models struggle with multi-step spatial logical reasoning in complex real-world scenarios. To this end, it introduces SpatiaLQA, the first systematically defined benchmark comprising 9,605 question-answer pairs across 241 indoor scenes, specifically designed to evaluate such reasoning capabilities. The authors propose a recursive scene graph–assisted reasoning method that progressively decomposes intricate spatial relationships by integrating foundation vision models with task-driven structured graph representations. Extensive experiments across 41 state-of-the-art vision-language models demonstrate that the proposed approach substantially outperforms existing methods, significantly enhancing models’ performance in spatial logical reasoning.
📝 Abstract
Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects in complex scenes, but also the logical dependencies between steps in multi-step tasks. To bridge this gap, we introduce Spatial Logical Question Answering (SpatiaLQA), a benchmark designed to evaluate the spatial logical reasoning capabilities of VLMs. SpatiaLQA consists of 9,605 question answer pairs derived from 241 real-world indoor scenes. We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning. To address this issue, we propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods. Code and dataset are available at https://github.com/xieyc99/SpatiaLQA.