TempViz: On the Evaluation of Temporal Knowledge in Text-to-Image Models

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenge of accurately modeling and evaluating temporally grounded visual changes in text-to-image generation. To this end, the authors introduce TempViz, the first benchmark dataset specifically designed for assessing temporal knowledge in such models, comprising 7.9k textual prompts and over 600 reference images spanning five temporal dimensions: seasons, decades, life cycles, diurnal cycles, and historical periods. Through a combination of human evaluation and multiple automated metrics, the study systematically evaluates five state-of-the-art models, revealing that none achieve above 75% accuracy in temporal reasoning. Furthermore, existing automatic evaluation methods prove unreliable in capturing temporal cues. This research presents the first comprehensive framework for evaluating the temporal reasoning capabilities of text-to-image models, highlighting significant limitations in current approaches and underscoring the urgency of advancing this underexplored direction.

Technology Category

Application Category

📝 Abstract

Time alters the visual appearance of entities in our world, like objects, places, and animals. Thus, for accurately generating contextually-relevant images, knowledge and reasoning about time can be crucial (e.g., for generating a landscape in spring vs. in winter). Yet, although substantial work exists on understanding and improving temporal knowledge in natural language processing, research on how temporal phenomena appear and are handled in text-to-image (T2I) models remains scarce. We address this gap with TempViz, the first data set to holistically evaluate temporal knowledge in image generation, consisting of 7.9k prompts and more than 600 reference images. Using TempViz, we study the capabilities of five T2I models across five temporal knowledge categories. Human evaluation shows that temporal competence is generally weak, with no model exceeding 75% accuracy across categories. Towards larger-scale studies, we also examine automated evaluation methods, comparing several established approaches against human judgments. However, none of these approaches provides a reliable assessment of temporal cues - further indicating the pressing need for future research on temporal knowledge in T2I.

Problem

Research questions and friction points this paper is trying to address.

temporal knowledge

text-to-image models

image generation

time-awareness

visual reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal knowledge

text-to-image generation

evaluation dataset