🤖 AI Summary
This work investigates the spatial reasoning limitations of multimodal large language models (MLLMs) in image rotation angle recognition (0°/90°/180°/270°). To this end, we introduce RotBench—the first dedicated benchmark for evaluating rotation robustness—comprising 350 manually curated images. Systematic evaluation of mainstream MLLMs reveals that while models reliably recognize 0° and some 180° rotations, they consistently confuse 90° and 270° orientations; conventional data augmentation and fine-tuning fail to mitigate this deficiency, exposing fundamental constraints in their spatial representations. We propose a model-agnostic multi-view input with majority voting, significantly boosting performance of weaker models without architectural modification. Furthermore, integrating depth maps, image captions, and chain-of-thought prompting enables collaborative reasoning that highlights a persistent human-AI perception gap. This work establishes a novel, scalable benchmark and methodology for assessing and advancing MLLMs’ spatial understanding capabilities.
📝 Abstract
We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0°, 90°, 180°, and 270°. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0°) images, while certain models are able to identify upside-down (180°) images. None can reliably distinguish between 90° and 270°. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90° and 270° rotations, despite substantially improving the identification of 180° images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.