🤖 AI Summary
Multimodal large language models (MLLMs) exhibit limited spatial reasoning capabilities when processing pure visual inputs. Method: This paper introduces the first training-free, vision-only spatial prompting framework. It explicitly models spatial relationships and temporal coherence through semantic-rich keyframe sampling—leveraging off-the-shelf perception models—and visual trajectory simulation based on relative positional encoding. Unlike conventional approaches requiring fine-tuning or auxiliary modalities, our method enhances spatial reasoning solely via input-side visual structural augmentation. Contribution/Results: Evaluated on VSI-BENCH and STI-BENCH, the framework consistently improves spatial reasoning performance across multiple state-of-the-art MLLMs, with gains up to 3.5%. These results demonstrate its effectiveness, efficiency, and strong generalizability without any parameter updates or multimodal inputs.
📝 Abstract
We introduce SEE&TREK, the first training-free prompting framework tailored to enhance the spatial understanding of Multimodal Large Language Models (MLLMS) under vision-only constraints. While prior efforts have incorporated modalities like depth or point clouds to improve spatial reasoning, purely visualspatial understanding remains underexplored. SEE&TREK addresses this gap by focusing on two core principles: increasing visual diversity and motion reconstruction. For visual diversity, we conduct Maximum Semantic Richness Sampling, which employs an off-the-shell perception model to extract semantically rich keyframes that capture scene structure. For motion reconstruction, we simulate visual trajectories and encode relative spatial positions into keyframes to preserve both spatial relations and temporal coherence. Our method is training&GPU-free, requiring only a single forward pass, and can be seamlessly integrated into existing MLLM'S. Extensive experiments on the VSI-B ENCH and STI-B ENCH show that S EE &T REK consistently boosts various MLLM S performance across diverse spatial reasoning tasks with the most +3.5% improvement, offering a promising path toward stronger spatial intelligence.