๐ค AI Summary
Vision-language models (VLMs) suffer from a fundamental semantic-geometric misalignment in spatial reasoning, leading to unverifiable inference and uncontrolled planning. To address this, we propose the Geometry-Constrained Agent (GCA) paradigmโa training-free framework that explicitly enforces formal task constraints throughout the entire reasoning process. GCA strictly decouples semantic parsing (performed by the VLM) from geometric solving (executed by deterministic, domain-specific tools), thereby eliminating reliance on unrealistic โoracleโ assumptions prevalent in prior work. It establishes a verifiable constraint framework that guarantees end-to-end reasoning within rigorous geometric bounds. Evaluated across multiple spatial reasoning benchmarks, GCA achieves state-of-the-art performance, delivering an average 27% improvement over prior methods while significantly enhancing accuracy, robustness, and formal verifiability.
๐ Abstract
Vision Language Models (VLMs) exhibit a fundamental semantic-to-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an ``oracle paradox,'' learning flawed spatial logic from imperfect oracles. Tool-integrated methods constrain the final computation but critically leave the VLM's planning process unconstrained, resulting in geometrically flawed plans. In this work, we propose Geometrically-Constrained Agent (GCA), a training-free agentic paradigm that resolves this gap by introducing a formal task constraint. Specifically, we strategically decouples the VLM's role into two stages. First, acting as a semantic analyst, the VLM translates the user's ambiguous query into the formal, verifiable task constraint, which defines the reference frame and objective. Second, acting as a task solver, the VLM generates and executes tool calls strictly within the deterministic bounds defined by the constraint. This geometrically-constrained reasoning strategy successfully resolve the semantic-to-geometric gap, yielding a robust and verifiable reasoning pathway for spatial reasoning. Comprehensive experiments demonstrate that GCA achieves SOTA performance on multiple spatial reasoning benchmarks, surpassing existing training-based and tool-integrated methods by ~27%. Please see our homepage at https://gca-spatial-reasoning.github.io.