Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation

๐Ÿ“… 2026-02-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current vision-language models exhibit fragility in allocentric spatial reasoning tasks, struggling to disentangle allocentric spatial relationships from egocentric visual priors. This work proposes a training-free strategy that leverages off-the-shelf 3D geometric reconstruction tools to recover metric 3D scene states from single or multiple images and instantiates an aligned allocentric reference frame guided by instruction semantics, thereby transforming implicit โ€œmental rotationโ€ into explicit geometric computation. This approach achieves, for the first time, disentanglement of allocentric reasoning from egocentric visual priors without any fine-tuning, significantly enhancing spatial generalization. It yields consistent performance gains of approximately 10% across multiple spatial reasoning benchmarks while preserving strong performance on egocentric tasks, outperforming both specialized fine-tuned models and current state-of-the-art open- and closed-source systems.

Technology Category

Application Category

๐Ÿ“ Abstract
With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction's semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ($\sim$10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.
Problem

Research questions and friction points this paper is trying to address.

allocentric reasoning
egocentric visual priors
spatial reasoning
vision-language models
perspective shift
Innovation

Methods, ideas, or system contributions that make the work stand out.

Allocentric Perception
Frame Instantiation
3D Geometry Reconstruction
Vision-Language Models
Perspective Transformation
๐Ÿ”Ž Similar Papers
No similar papers found.