Have Multimodal Large Language Models (MLLMs) Really Learned to Tell the Time on Analog Clocks?

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work exposes a critical limitation of multimodal large language models (MLLMs, e.g., GPT-4.1): their heavy reliance on superficial visual patterns—rather than genuine geometric and spatial reasoning—for analog clock time recognition. To rigorously assess generalization, we introduce the first diverse, structurally controlled clock benchmark, systematically varying hand proportions, dial styles, lighting conditions, and compositional layouts. We design a zero-shot evaluation protocol alongside synthetic data augmentation and lightweight fine-tuning experiments. Results show that fine-tuning improves only in-distribution accuracy; performance collapses under unseen dial structures or geometric configurations. Crucially, MLLMs fail to abstract the rigid angular relationships between hour and minute hands. This study provides the first systematic empirical evidence of fundamental deficits in basic spatiotemporal geometric reasoning in current MLLMs, establishing a new benchmark and methodological framework for evaluating multimodal representation learning and embodied spatial reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models which can answer complex questions on an image struggle to tell the time on analog clocks. This is probably due to the lack of images with clocks at different times in their training set. In this work we explore this issue with one of the latest MLLMs: GPT-4.1 to understand why MLLMs fail to tell the time and whether fine-tuning can solve the problem. The results show how models are making progress in reading the time on analog clocks. But have they really learned to do it, or have they only learned patterns in their training datasets? In this work we put the models to the test with different clocks to illustrate the limitations of MLLMs to abstract and generalize.

Problem

Research questions and friction points this paper is trying to address.

MLLMs struggle to tell time on analog clocks

Lack of diverse clock images in training data

Testing if fine-tuning improves time-telling generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning GPT-4.1 for analog clock time-telling

Testing MLLMs with varied clock images

Analyzing training data patterns in MLLMs

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time