Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding

📅 2026-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of high-precision multimodal spatiotemporal modeling in planetary-scale dynamical systems—such as ecosystems—where large-scale annotated data are scarce. We propose DeepEarth, a self-supervised multimodal world model featuring the novel Earth4D encoder, which extends multi-resolution hash embeddings into the temporal dimension to enable global 4D modeling at sub-meter spatial and sub-second temporal resolution across century-scale timeframes. A learnable hash probe is introduced to enhance data efficiency. By integrating multimodal signals—including vision and language—through a masked reconstruction objective, DeepEarth achieves state-of-the-art performance on ecological forecasting benchmarks, substantially outperforming existing large multimodal foundation models. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract
We present DeepEarth, a self-supervised multi-modal world model with Earth4D, a novel planetary-scale 4D space-time positional encoder. Earth4D extends 3D multi-resolution hash encoding to include time, efficiently scaling across the planet over centuries with sub-meter, sub-second precision. Multi-modal encoders (e.g. vision-language models) are fused with Earth4D embeddings and trained via masked reconstruction. We demonstrate Earth4D's expressive power by achieving state-of-the-art performance on an ecological forecasting benchmark. Earth4D with learnable hash probing surpasses a multi-modal foundation model pre-trained on substantially more data. Access open source code and download models at: https://github.com/legel/deepearth
Problem

Research questions and friction points this paper is trying to address.

self-supervised
multi-modal
world model
4D space-time embedding
ecological forecasting
Innovation

Methods, ideas, or system contributions that make the work stand out.

4D space-time embedding
self-supervised world model
multi-resolution hash encoding
multi-modal fusion
masked reconstruction
🔎 Similar Papers
No similar papers found.
L
Lance Legel
Ecological Intelligence Lab, Ecodash.ai
Q
Qin Huang
School of Complex Adaptive Systems, Arizona State University
B
Brandon Voelker
Geosensing Systems Engineering & Sciences Lab, University of Houston
Daniel Neamati
Daniel Neamati
PhD Student, Stanford University
GNSSDigital TwinsRoboticsPlanetary ScienceWildfires
P
Patrick Alan Johnson
Earth System Lab, Allen Institute for Artificial Intelligence
Favyen Bastani
Favyen Bastani
MIT CSAIL
J
Jeff Rose
Spatial Intelligence Lab, SpatialLogic.com
J
James Ryan Hennessy
Department of Computer Science, Georgia Institute of Technology
Robert Guralnick
Robert Guralnick
Curator of Biodiversity Informatics, Florida Museum of Natural History, University of Florida
MacroecologyGlobal Change BiologyBiodiversity InformaticsSystematics
D
Douglas Soltis
Florida Museum of Natural History, University of Florida
P
Pamela Soltis
Florida Museum of Natural History, University of Florida
Shaowen Wang
Shaowen Wang
Professor, University of Illinois Urbana-Champaign
CyberGISGeospatial Data ScienceSpatial AISpatial AnalysisSustainability