🤖 AI Summary
This work addresses the challenge of error accumulation in reference frame transformations during multi-hop relative spatial reasoning, which often leads to 3D layouts that are inconsistent both semantically and metrically. To mitigate this issue, the authors propose the R³L framework, which decomposes spatial relationships into invariant subspaces to disentangle relational chains, employs an imagine-and-refine loop to enhance self-consistency, and reparameterizes coordinates from global to local to simplify pose optimization. By integrating multimodal large language models, spatial relation reasoning, and iterative refinement, R³L generates 3D layouts that better adhere to physical constraints and semantic coherence across diverse scenes and instructions, significantly alleviating the adverse effects of reference frame inconsistency in multi-hop spatial reasoning.
📝 Abstract
Relative spatial relations provide a compact representation of spatial structure and are fundamental to relative spatial reasoning in 3D layout generation. Recent works leverage Multimodal Large Language Models (MLLMs) to infer such relations, but the inferred relations are often unreliable and are typically handled with post-hoc heuristics. In this paper, we propose R$^3$L, a general framework that improves the reliability and consistency of relative spatial reasoning for 3D layout generation. Our key motivation is that multi-hop reasoning requires repeated reference-frame transformations, which accumulate errors in inferred relations and lead to semantic and metric drift. To mitigate this, we propose invariant spatial decomposition to break coupled relation chains, and consistent spatial imagination to promote self-consistency through an imagine-and-revise loop. We further introduce supportive spatial optimization to ease pose optimization via global-to-local coordinate re-parameterization. Extensive experiments across diverse scene types and instructions demonstrate that R$^3$L produces more physically feasible and semantically consistent layouts. Notably, our analysis shows that resolving frame-induced inconsistencies is crucial for reliable multi-hop relative spatial reasoning. The code is available at https://github.com/Neal2020GitHub/R3L.