SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that current vision-language models struggle with multi-step spatial logical reasoning in complex real-world scenarios. To this end, it introduces SpatiaLQA, the first systematically defined benchmark comprising 9,605 question-answer pairs across 241 indoor scenes, specifically designed to evaluate such reasoning capabilities. The authors propose a recursive scene graph–assisted reasoning method that progressively decomposes intricate spatial relationships by integrating foundation vision models with task-driven structured graph representations. Extensive experiments across 41 state-of-the-art vision-language models demonstrate that the proposed approach substantially outperforms existing methods, significantly enhancing models’ performance in spatial logical reasoning.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects in complex scenes, but also the logical dependencies between steps in multi-step tasks. To bridge this gap, we introduce Spatial Logical Question Answering (SpatiaLQA), a benchmark designed to evaluate the spatial logical reasoning capabilities of VLMs. SpatiaLQA consists of 9,605 question answer pairs derived from 241 real-world indoor scenes. We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning. To address this issue, we propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods. Code and dataset are available at https://github.com/xieyc99/SpatiaLQA.
Problem

Research questions and friction points this paper is trying to address.

spatial logical reasoning
vision-language models
visual question answering
scene understanding
multi-step reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial logical reasoning
vision-language models
scene graph
recursive reasoning
visual question answering
🔎 Similar Papers
No similar papers found.
Y
Yuechen Xie
Zhejiang University
X
Xiaoyan Zhang
Zhejiang University
Y
Yicheng Shan
The University of Sydney
H
Hao Zhu
ManyCore
R
Rui Tang
ManyCore
R
Rong Wei
ManyCore
M
Mingli Song
Zhejiang University
Yuanyu Wan
Yuanyu Wan
Zhejiang University
Machine LearningOnline LearningDistributed Optimization
J
Jie Song
Zhejiang University