GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the limitations of multimodal large language models (MLLMs) in spatial reasoning tasks, which stem from their reliance on static, single-layer geometric feature extraction and hinder diverse comprehension capabilities. To overcome this, the authors propose GeoAlign, a novel framework that introduces, for the first time, a dynamic multi-layer geometric feature alignment mechanism. GeoAlign constructs a hierarchical geometric feature bank and employs the MLLM’s original visual tokens as content-aware queries, combined with inter-layer sparse routing, to adaptively select and fuse multi-scale geometric information across image regions. This approach transcends the constraints of single-layer features by enabling task-driven geometric-semantic alignment. Experimental results demonstrate that GeoAlign achieves state-of-the-art performance on benchmarks including VSI-Bench, ScanQA, and SQA3D, with its 4B-parameter variant even outperforming larger existing MLLMs.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have exhibited remarkable performance in various visual tasks, yet still struggle with spatial reasoning. Recent efforts mitigate this by injecting geometric features from 3D foundation models, but rely on static single-layer extractions. We identify that such an approach induces a task misalignment bias: the geometric features naturally evolve towards 3D pretraining objectives, which may contradict the heterogeneous spatial demands of MLLMs, rendering any single layer fundamentally insufficient. To resolve this, we propose GeoAlign, a novel framework that dynamically aggregates multi-layer geometric features to realign with the actual demands. GeoAlign constructs a hierarchical geometric feature bank and leverages the MLLM's original visual tokens as content-aware queries to perform layer-wise sparse routing, adaptively fetching the suitable geometric features for each patch. Extensive experiments on VSI-Bench, ScanQA, and SQA3D demonstrate that our compact 4B model effectively achieves state-of-the-art performance, even outperforming larger existing MLLMs.

Problem

Research questions and friction points this paper is trying to address.

spatial reasoning

multimodal large language models

geometric features

task misalignment

3D foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

GeoAlign

geometric feature realignment

multimodal large language models