๐ค AI Summary
Current vision-language models exhibit fragility in allocentric spatial reasoning tasks, struggling to disentangle allocentric spatial relationships from egocentric visual priors. This work proposes a training-free strategy that leverages off-the-shelf 3D geometric reconstruction tools to recover metric 3D scene states from single or multiple images and instantiates an aligned allocentric reference frame guided by instruction semantics, thereby transforming implicit โmental rotationโ into explicit geometric computation. This approach achieves, for the first time, disentanglement of allocentric reasoning from egocentric visual priors without any fine-tuning, significantly enhancing spatial generalization. It yields consistent performance gains of approximately 10% across multiple spatial reasoning benchmarks while preserving strong performance on egocentric tasks, outperforming both specialized fine-tuned models and current state-of-the-art open- and closed-source systems.
๐ Abstract
With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction's semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ($\sim$10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.