Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Current vision-language benchmarks conflate object orientation with spatial concepts such as position and scene context, hindering accurate evaluation of multimodal large language models’ orientation reasoning capabilities. This work proposes DORI, a cognitively inspired hierarchical benchmark that, for the first time, decomposes orientation into four dimensions grounded in human cognitive development. By leveraging bounding-box isolation, standardized spatial reference frames, and structured multiple-choice questions, DORI constructs a large-scale evaluation set of 33,656 questions spanning both coarse-grained (categorical) and fine-grained (metric) levels. Experiments across 24 state-of-the-art models reveal that current systems perform near chance on object-centric orientation tasks (best accuracy: 54.2%/45.0%), with pronounced failures in compound rotations and reference frame transformations, exposing their fundamental reliance on category-based heuristics rather than genuine geometric reasoning.

Technology Category

Application Category

📝 Abstract

Humans learn object orientation progressively, from recognizing which way an object faces, to mentally rotating it, to reasoning about orientations between objects. Current vision-language benchmarks largely conflate orientation with position and general scene understanding. We introduce Discriminative Orientation Reasoning Intelligence (DORI), a cognitively grounded hierarchical benchmark that makes object orientation the primary target. Inspired by stages of human orientation cognition, DORI decomposes orientation into four dimensions, each evaluated at coarse (categorical) and granular (metric) levels. Composed from 13,652 images across 14 sources, DORI provides 33,656 multiple-choice questions covering 67 object categories in real-world and synthetic settings. Its coarse-to-granular design isolates orientation from confounds such as object recognition difficulty, scene clutter, and linguistic ambiguity via bounding-box isolation, standardized spatial reference frames, and structured prompts. Evaluating 24 state-of-the-art vision-language models shows a clear pattern: models that perform well on general spatial benchmarks are near-random on object-centric orientation tasks. The best models reach only 54.2% on coarse and 45.0% on granular judgments, with largest failures on compound rotations and shifts in inter-object reference frames. Large coarse-to-granular gaps reveal reliance on categorical heuristics rather than geometric reasoning, a limitation hidden by existing benchmarks. These results identify orientation understanding as an unsolved challenge for multimodal systems, with implications for robotic manipulation, 3D scene reconstruction, and human-AI interaction.

Problem

Research questions and friction points this paper is trying to address.

object orientation

multimodal large language models

spatial reasoning

visual benchmarks

cognitive grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

orientation reasoning

multimodal large language models

cognitive benchmark