TARDIS STRIDE: A Spatio-Temporal Road Image Dataset for Exploration and Autonomy

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the spatiotemporal joint modeling of dynamic real-world environments for embodied agents. To this end, we introduce STRIDE—the first Spatiotemporal Road Image Dataset explicitly designed for exploration and autonomy—built upon 360° panoramic imagery to capture the coupled spatial and temporal evolution of road scenes. We propose a spatiotemporally coupled, graph-structured road observation representation that unifies multi-view, multi-coordinate-system, and action-space observations. Furthermore, we design TARDIS, a Transformer-based architecture enabling instruction-conditioned, unified spatial-temporal autoregressive world modeling. Evaluated on controllable image synthesis, instruction following, autonomous navigation, and georegistration, our approach achieves state-of-the-art performance, significantly enhancing embodied agents’ spatiotemporal understanding of physical environments and their capacity for grounded, physics-aware interaction.

Technology Category

Application Category

📝 Abstract
World models aim to simulate environments and enable effective agent behavior. However, modeling real-world environments presents unique challenges as they dynamically change across both space and, crucially, time. To capture these composed dynamics, we introduce a Spatio-Temporal Road Image Dataset for Exploration (STRIDE) permuting 360-degree panoramic imagery into rich interconnected observation, state and action nodes. Leveraging this structure, we can simultaneously model the relationship between egocentric views, positional coordinates, and movement commands across both space and time. We benchmark this dataset via TARDIS, a transformer-based generative world model that integrates spatial and temporal dynamics through a unified autoregressive framework trained on STRIDE. We demonstrate robust performance across a range of agentic tasks such as controllable photorealistic image synthesis, instruction following, autonomous self-control, and state-of-the-art georeferencing. These results suggest a promising direction towards sophisticated generalist agents--capable of understanding and manipulating the spatial and temporal aspects of their material environments--with enhanced embodied reasoning capabilities. Training code, datasets, and model checkpoints are made available at https://huggingface.co/datasets/Tera-AI/STRIDE.
Problem

Research questions and friction points this paper is trying to address.

Modeling dynamic real-world environments across space and time
Integrating spatial and temporal dynamics for agent behavior
Enhancing embodied reasoning in generalist agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatio-temporal road image dataset for dynamic modeling
Transformer-based generative world model integrating dynamics
Unified autoregressive framework for spatial-temporal tasks