SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) exhibit limited performance on spatial reasoning and fine-grained visual perception tasks, primarily due to the absence of spatially grounded prompt-region awareness and dynamic attention calibration mechanisms. To address this, we propose SIF (Spatially-Informed Focusing), a novel framework featuring: (1) a spatially-aware image focusing module that dynamically attends to task-relevant regions via depth-enhanced bounding boxes and interleaved image-text representations; (2) an inverted augmentation forward inference strategy and the GRPO-SIF reinforcement training paradigm; and (3) the SIF-50K high-quality process supervision dataset. SIF enables interleaved image-text chain-of-thought generation and end-to-end visual grounding reasoning. Extensive experiments demonstrate that SIF achieves significant improvements over state-of-the-art methods across diverse spatial and fine-grained vision benchmarks, while maintaining strong generalization—validating the efficacy and robustness of spatially informed attention mechanisms.

Technology Category

Application Category

📝 Abstract

Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware "think-with-images" framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method.

Problem

Research questions and friction points this paper is trying to address.

Improving spatial understanding in visual reasoning tasks

Enhancing fine-grained perception through attention correction

Integrating spatial cues for dynamic region focusing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatially-aware framework mimics human visual perception

Depth-enhanced bounding boxes interleaved with language

Reinforced training integrates visual grounding into reasoning

🔎 Similar Papers

Mind the GAP: Glimpse-based Active Perception improves generalization and sample efficiency of visual reasoning