Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding

📅 2026-03-07

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the challenge of high-precision multimodal spatiotemporal modeling in planetary-scale dynamical systems—such as ecosystems—where large-scale annotated data are scarce. We propose DeepEarth, a self-supervised multimodal world model featuring the novel Earth4D encoder, which extends multi-resolution hash embeddings into the temporal dimension to enable global 4D modeling at sub-meter spatial and sub-second temporal resolution across century-scale timeframes. A learnable hash probe is introduced to enhance data efficiency. By integrating multimodal signals—including vision and language—through a masked reconstruction objective, DeepEarth achieves state-of-the-art performance on ecological forecasting benchmarks, substantially outperforming existing large multimodal foundation models. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract

We present DeepEarth, a self-supervised multi-modal world model with Earth4D, a novel planetary-scale 4D space-time positional encoder. Earth4D extends 3D multi-resolution hash encoding to include time, efficiently scaling across the planet over centuries with sub-meter, sub-second precision. Multi-modal encoders (e.g. vision-language models) are fused with Earth4D embeddings and trained via masked reconstruction. We demonstrate Earth4D's expressive power by achieving state-of-the-art performance on an ecological forecasting benchmark. Earth4D with learnable hash probing surpasses a multi-modal foundation model pre-trained on substantially more data. Access open source code and download models at: https://github.com/legel/deepearth

Problem

Research questions and friction points this paper is trying to address.

self-supervised

multi-modal

world model

4D space-time embedding

ecological forecasting

Innovation

Methods, ideas, or system contributions that make the work stand out.

4D space-time embedding

self-supervised world model

multi-resolution hash encoding