Chain of Time: In-Context Physical Simulation with Image Generation Models

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limited physical simulation capability and poor interpretability of vision-language models (VLMs). We propose a cognitively inspired “temporal chain” reasoning method that operates at inference time without fine-tuning. It explicitly models sequential physical phenomena—including velocity, gravity, collisions, fluid dynamics, and momentum conservation—by generating intermediate image frames along the temporal trajectory. Our core innovation lies in a chain-based image generation mechanism, inspired by human mental simulation, integrated with in-context learning and multi-step image synthesis. The method simultaneously enhances physical prediction accuracy and process interpretability across both 2D synthetic graphics and real-world 3D video scenes. Experiments demonstrate substantial improvements over existing zero-shot physical reasoning benchmarks. Moreover, our analysis of intermediate states is the first to empirically uncover the implicit dynamic reasoning capabilities—and their limitations—within generative models, thereby expanding the evaluation dimensions of VLMs’ physical understanding.

Technology Category

Application Category

📝 Abstract
We propose a novel cognitively-inspired method to improve and interpret physical simulation in vision-language models. Our ``Chain of Time" method involves generating a series of intermediate images during a simulation, and it is motivated by in-context reasoning in machine learning, as well as mental simulation in humans. Chain of Time is used at inference time, and requires no additional fine-tuning. We apply the Chain-of-Time method to synthetic and real-world domains, including 2-D graphics simulations and natural 3-D videos. These domains test a variety of particular physical properties, including velocity, acceleration, fluid dynamics, and conservation of momentum. We found that using Chain-of-Time simulation substantially improves the performance of a state-of-the-art image generation model. Beyond examining performance, we also analyzed the specific states of the world simulated by an image model at each time step, which sheds light on the dynamics underlying these simulations. This analysis reveals insights that are hidden from traditional evaluations of physical reasoning, including cases where an image generation model is able to simulate physical properties that unfold over time, such as velocity, gravity, and collisions. Our analysis also highlights particular cases where the image generation model struggles to infer particular physical parameters from input images, despite being capable of simulating relevant physical processes.
Problem

Research questions and friction points this paper is trying to address.

Improving physical simulation accuracy in vision-language models
Analyzing intermediate simulation steps for interpretable reasoning
Testing physical properties like velocity and fluid dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates intermediate images for physical simulation
Uses in-context reasoning without fine-tuning
Analyzes temporal physical properties like velocity and collisions
🔎 Similar Papers
2024-09-10arXiv.orgCitations: 0
Y
YingQiao Wang
Department of Psychology, Harvard University
E
Eric Bigelow
Department of Psychology, Harvard University
B
Boyi Li
Department of Computer Science, UC Berkeley
Tomer Ullman
Tomer Ullman
Assistant Professor, Harvard
Cognitive ScienceComputational ModelingCognitive DevelopmentArtificial Intelligence