From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the challenge that existing models struggle to effectively integrate multi-view information for human-like spatial reasoning in multi-image tasks. To bridge this gap, we propose HATCH, a novel training framework that, for the first time, incorporates explicit patch-level spatial alignment across views and an action-then-answer reasoning mechanism, jointly mimicking human multi-perspective spatial cognition. By co-optimizing these two complementary objectives, HATCH preserves strong single-image reasoning capabilities while significantly enhancing the spatial reasoning performance of multimodal large language models. Experimental results demonstrate that HATCH substantially outperforms same-scale baselines across three established benchmarks, achieving performance on par with considerably larger models.

Technology Category

Application Category

📝 Abstract

While multimodal large language models (MLLMs) have made substantial progress in single-image spatial reasoning, multi-image spatial reasoning, which requires integration of information from multiple viewpoints, remains challenging. Cognitive studies suggest that humans address such tasks through two mechanisms: cross-view correspondence, which identifies regions across different views that correspond to the same physical locations, and stepwise viewpoint transformation, which composes relative viewpoint changes sequentially. However, existing studies incorporate these mechanisms only partially and often implicitly, without explicit supervision for both. We propose Human-Aware Training for Cross-view correspondence and viewpoint cHange (HATCH), a training framework with two complementary objectives: (1) Patch-Level Spatial Alignment, which encourages patch representations to align across views for spatially corresponding regions, and (2) Action-then-Answer Reasoning, which requires the model to generate explicit viewpoint transition actions before predicting the final answer. Experiments on three benchmarks demonstrate that HATCH consistently outperforms baselines of comparable size by a clear margin and achieves competitive results against much larger models, while preserving single-image reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

multi-image spatial reasoning

cross-view correspondence

viewpoint transformation

multimodal large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-view correspondence

viewpoint transformation

spatial reasoning

multimodal large language models