Why MLLMs Struggle to Determine Object Orientations

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

This study challenges the prevailing assumption that multimodal large language models (MLLMs) struggle with estimating 2D object orientation in images due to insufficient geometric reasoning in their visual encoders. Through controlled experiments, the authors employ simple linear regression models to predict rotation angles directly from visual features extracted by encoders such as SigLIP, ViT, and CLIP, evaluating performance across MLLM frameworks including LLaVA-OneVision, Qwen2.5-VL, and LLaVA 1.5/1.6. The results demonstrate that orientation information can be recovered with high accuracy using only linear probes, indicating that visual encoders do preserve this information—albeit in a sparse and distributed manner across high-dimensional feature spaces. Consequently, the bottleneck in orientation understanding lies not in the visual representation itself, but in subsequent alignment or reasoning mechanisms within the MLLM architecture.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) struggle with tasks that require reasoning about 2D object orientation in images, as documented in prior work. Tong et al. and Nichols et al. hypothesize that these failures originate in the visual encoder, since commonly used encoders such as CLIP and SigLIP are trained for image-text semantic alignment rather than geometric reasoning. We design a controlled empirical protocol to test this claim by measuring whether rotations can be recovered from encoder representations. In particular, we examine SigLIP and ViT features from LLaVA OneVision and Qwen2.5-VL-7B-Instruct models, respectively, using full images, and examine CLIP representations in LLaVA 1.5 and 1.6 using rotated foreground patches against natural background images. Our null hypothesis is that orientation information is not preserved in the encoder embeddings and we test this by training linear regressors to predict object orientation from encoded features. Contrary to the hypothesis, we find that orientation information is recoverable from encoder representations: simple linear models accurately predict object orientations from embeddings. This contradicts the assumption that MLLM orientation failures originate in the visual encoder. Having rejected the accepted hypothesis that MLLMs struggle with 2D orientation tasks because of visual encoder limitations, we still don't know why they fail. Although a full explanation is beyond the scope of this paper, we show that although present, orientation information is spread diffusely across tens of thousands of features. This may or may not be while MLLMs fail to exploit the available orientation information.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

object orientation

visual encoder

geometric reasoning

2D orientation

Innovation

Methods, ideas, or system contributions that make the work stand out.

object orientation

visual encoder

multimodal large language models