🤖 AI Summary
This work exposes a critical limitation of multimodal large language models (MLLMs, e.g., GPT-4.1): their heavy reliance on superficial visual patterns—rather than genuine geometric and spatial reasoning—for analog clock time recognition. To rigorously assess generalization, we introduce the first diverse, structurally controlled clock benchmark, systematically varying hand proportions, dial styles, lighting conditions, and compositional layouts. We design a zero-shot evaluation protocol alongside synthetic data augmentation and lightweight fine-tuning experiments. Results show that fine-tuning improves only in-distribution accuracy; performance collapses under unseen dial structures or geometric configurations. Crucially, MLLMs fail to abstract the rigid angular relationships between hour and minute hands. This study provides the first systematic empirical evidence of fundamental deficits in basic spatiotemporal geometric reasoning in current MLLMs, establishing a new benchmark and methodological framework for evaluating multimodal representation learning and embodied spatial reasoning capabilities.
📝 Abstract
Multimodal Large Language Models which can answer complex questions on an image struggle to tell the time on analog clocks. This is probably due to the lack of images with clocks at different times in their training set. In this work we explore this issue with one of the latest MLLMs: GPT-4.1 to understand why MLLMs fail to tell the time and whether fine-tuning can solve the problem. The results show how models are making progress in reading the time on analog clocks. But have they really learned to do it, or have they only learned patterns in their training datasets? In this work we put the models to the test with different clocks to illustrate the limitations of MLLMs to abstract and generalize.