🤖 AI Summary
This work addresses the challenge of high-precision multimodal spatiotemporal modeling in planetary-scale dynamical systems—such as ecosystems—where large-scale annotated data are scarce. We propose DeepEarth, a self-supervised multimodal world model featuring the novel Earth4D encoder, which extends multi-resolution hash embeddings into the temporal dimension to enable global 4D modeling at sub-meter spatial and sub-second temporal resolution across century-scale timeframes. A learnable hash probe is introduced to enhance data efficiency. By integrating multimodal signals—including vision and language—through a masked reconstruction objective, DeepEarth achieves state-of-the-art performance on ecological forecasting benchmarks, substantially outperforming existing large multimodal foundation models. The code and models are publicly released.
📝 Abstract
We present DeepEarth, a self-supervised multi-modal world model with Earth4D, a novel planetary-scale 4D space-time positional encoder. Earth4D extends 3D multi-resolution hash encoding to include time, efficiently scaling across the planet over centuries with sub-meter, sub-second precision. Multi-modal encoders (e.g. vision-language models) are fused with Earth4D embeddings and trained via masked reconstruction. We demonstrate Earth4D's expressive power by achieving state-of-the-art performance on an ecological forecasting benchmark. Earth4D with learnable hash probing surpasses a multi-modal foundation model pre-trained on substantially more data. Access open source code and download models at: https://github.com/legel/deepearth