DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work investigates the compositional generalization capabilities of large language models (LLMs) on multi-hop spatial reasoning tasks, evaluating four key dimensions: productivity (reasoning depth), substitutability (robustness to entity and linguistic variation), perturbation invariance (resilience to input ordering and distractors), and systematicity (zero-shot transfer to novel linguistic elements). To this end, we introduce the first large-scale, decomposable, and programmatically generated spatial reasoning benchmark, grounded in a symbolic solver to guarantee annotation correctness and enabling orthogonal control over multiple compositional factors—a methodological first. Experimental results reveal that current LLMs exhibit substantial limitations in deep multi-step reasoning and systematic generalization, yet demonstrate notable robustness to linguistic variation. The proposed benchmark provides a verifiable, reproducible, and fine-grained evaluation framework for assessing compositional generalization in LLMs.

Technology Category

Application Category

📝 Abstract

We introduce DecompSR, decomposed spatial reasoning, a large benchmark dataset (over 5m datapoints) and generation framework designed to analyse compositional spatial reasoning ability. The generation of DecompSR allows users to independently vary several aspects of compositionality, namely: productivity (reasoning depth), substitutivity (entity and linguistic variability), overgeneralisation (input order, distractors) and systematicity (novel linguistic elements). DecompSR is built procedurally in a manner which makes it is correct by construction, which is independently verified using a symbolic solver to guarantee the correctness of the dataset. DecompSR is comprehensively benchmarked across a host of Large Language Models (LLMs) where we show that LLMs struggle with productive and systematic generalisation in spatial reasoning tasks whereas they are more robust to linguistic variation. DecompSR provides a provably correct and rigorous benchmarking dataset with a novel ability to independently vary the degrees of several key aspects of compositionality, allowing for robust and fine-grained probing of the compositional reasoning abilities of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Analyzes compositional spatial reasoning in large language models

Evaluates generalization across productivity and systematicity dimensions

Provides provably correct benchmark for fine-grained reasoning assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Procedurally generated benchmark for spatial reasoning analysis

Symbolic solver verification ensures dataset construction correctness

Independent variation of compositionality aspects enables fine-grained probing

🔎 Similar Papers

Do Large Language Models Latently Perform Multi-Hop Reasoning?