Have Multimodal Large Language Models (MLLMs) Really Learned to Tell the Time on Analog Clocks?

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes a critical limitation of multimodal large language models (MLLMs, e.g., GPT-4.1): their heavy reliance on superficial visual patterns—rather than genuine geometric and spatial reasoning—for analog clock time recognition. To rigorously assess generalization, we introduce the first diverse, structurally controlled clock benchmark, systematically varying hand proportions, dial styles, lighting conditions, and compositional layouts. We design a zero-shot evaluation protocol alongside synthetic data augmentation and lightweight fine-tuning experiments. Results show that fine-tuning improves only in-distribution accuracy; performance collapses under unseen dial structures or geometric configurations. Crucially, MLLMs fail to abstract the rigid angular relationships between hour and minute hands. This study provides the first systematic empirical evidence of fundamental deficits in basic spatiotemporal geometric reasoning in current MLLMs, establishing a new benchmark and methodological framework for evaluating multimodal representation learning and embodied spatial reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models which can answer complex questions on an image struggle to tell the time on analog clocks. This is probably due to the lack of images with clocks at different times in their training set. In this work we explore this issue with one of the latest MLLMs: GPT-4.1 to understand why MLLMs fail to tell the time and whether fine-tuning can solve the problem. The results show how models are making progress in reading the time on analog clocks. But have they really learned to do it, or have they only learned patterns in their training datasets? In this work we put the models to the test with different clocks to illustrate the limitations of MLLMs to abstract and generalize.
Problem

Research questions and friction points this paper is trying to address.

MLLMs struggle to tell time on analog clocks
Lack of diverse clock images in training data
Testing if fine-tuning improves time-telling generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning GPT-4.1 for analog clock time-telling
Testing MLLMs with varied clock images
Analyzing training data patterns in MLLMs
T
Tairan Fu
College of Mechanical and Electrical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, China
Miguel González
Miguel González
ETSI de Telecomunicación, Universidad Politécnica de Madrid, Madrid, Spain
J
Javier Conde
ETSI de Telecomunicación, Universidad Politécnica de Madrid, Madrid, Spain
Elena Merino-Gómez
Elena Merino-Gómez
Universidad de Valladolid
P
Pedro Reviriego
ETSI de Telecomunicación, Universidad Politécnica de Madrid, Madrid, Spain