RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the spatial reasoning limitations of multimodal large language models (MLLMs) in image rotation angle recognition (0°/90°/180°/270°). To this end, we introduce RotBench—the first dedicated benchmark for evaluating rotation robustness—comprising 350 manually curated images. Systematic evaluation of mainstream MLLMs reveals that while models reliably recognize 0° and some 180° rotations, they consistently confuse 90° and 270° orientations; conventional data augmentation and fine-tuning fail to mitigate this deficiency, exposing fundamental constraints in their spatial representations. We propose a model-agnostic multi-view input with majority voting, significantly boosting performance of weaker models without architectural modification. Furthermore, integrating depth maps, image captions, and chain-of-thought prompting enables collaborative reasoning that highlights a persistent human-AI perception gap. This work establishes a novel, scalable benchmark and methodology for assessing and advancing MLLMs’ spatial understanding capabilities.

Technology Category

Application Category

📝 Abstract
We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0°, 90°, 180°, and 270°. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0°) images, while certain models are able to identify upside-down (180°) images. None can reliably distinguish between 90° and 270°. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90° and 270° rotations, despite substantially improving the identification of 180° images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' ability to identify image rotation angles
Assessing spatial reasoning for rotated image orientation detection
Benchmarking model performance on distinguishing 90° and 270° rotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

RotBench benchmark tests MLLM rotation identification
Auxiliary information provides limited performance improvement
Fine-tuning fails to distinguish 90° and 270° rotations
🔎 Similar Papers
No similar papers found.