Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers

๐Ÿ“… 2025-07-22
๐Ÿ›๏ธ IEEE Transactions on Neural Networks and Learning Systems
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the insufficient causal modeling in existing video generation models for long-term prediction of dynamic physical systems. We propose a pure Transformer architecture that performs end-to-end spatiotemporal reasoning directly in continuous pixel spaceโ€”eliminating reliance on implicit feature learning or elaborate training strategies. Our method integrates spatiotemporal self-attention, autoregressive modeling, and a probe mechanism, enabling unsupervised training on physics simulation datasets while supporting physical object tracking and PDE parameter estimation. Experiments demonstrate that, while maintaining competitive mainstream video quality metrics (e.g., PSNR, SSIM), our approach improves physically accurate prediction horizon by 50%. Moreover, the model exhibits strong interpretability and out-of-distribution generalization, successfully transferring to unseen PDE parameter estimation tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.
Problem

Research questions and friction points this paper is trying to address.

Autoregressive video prediction of physical simulations using transformers
Improving long-term physical accuracy in video generation models
Enabling interpretable spatiotemporal reasoning without complex latent features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pixel-space transformers for video prediction
Unsupervised training on physical simulations
Interpretable spatiotemporal modeling without latent features
๐Ÿ”Ž Similar Papers
No similar papers found.
D
Dean L. Slack
Durham University, UK
G
G. Thomas Hudson
Durham University, UK
T
T. Winterbottom
Durham University, UK
Noura Al Moubayed
Noura Al Moubayed
Associate Professor in Machine Learning, Department of Computer Science, Durham University
Machine LearningNLPExplainabilityMechanistic InterpretabilityML for Healthcare