TerraMind: Large-Scale Generative Multimodality for Earth Observation

📅 2025-04-15
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of modeling Earth observation (EO) multimodal data—particularly the difficulty in jointly capturing fine-grained spatial details and high-level semantics—this paper introduces the first generative multimodal foundation model for EO supporting arbitrary modality-to-arbitrary modality translation. Methodologically, we propose a novel dual-scale (token-level + pixel-level) early-fusion pretraining paradigm, jointly trained on nine global geospatial modalities; we further introduce “Thinking-in-Modality” (TiM), a mechanism enabling dynamic in-modal sample augmentation during inference and fine-tuning. Our contributions include: (1) open-sourcing both the model weights and a high-quality, large-scale EO multimodal dataset; and (2) achieving state-of-the-art performance across standard benchmarks (e.g., PANGAEA), unifying cross-modal generation, semantic understanding, and spatial reasoning within a single framework, while significantly improving zero-shot and few-shot generalization capabilities.

Technology Category

Application Category

📝 Abstract
We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces"Thinking-in-Modalities"(TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code is open-sourced under a permissive license.
Problem

Research questions and friction points this paper is trying to address.

Develops first any-to-any generative multimodal model for Earth observation
Combines token-level and pixel-level data for cross-modal learning
Enables zero-shot and few-shot applications with dual-scale fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-scale token-pixel pretraining for EO
Thinking-in-Modalities generates artificial data
Early fusion enables zero-shot EO tasks
🔎 Similar Papers
No similar papers found.
Johannes Jakubik
Johannes Jakubik
Research Scientist @ IBM Research Europe
AI for Climate ImpactDeep learningMulti-modality
Benedikt Blumenstiel
Benedikt Blumenstiel
Research Software Engineer, IBM Research
Computer VisionFoundation ModelsEarth Observation
E
Erik Scheurer
Forschungszentrum Jülrich
R
Rocco Sedona
Forschungszentrum Jülrich
S
Stefano Maurogiovanni
University of Iceland
J
Jente Bosmans
European Space Agency Φ-Lab
N
Nikolaos Dionelis
European Space Agency Φ-Lab
N
Niklas Kopp
IBM Research – Europe
Rahul Ramachandran
Rahul Ramachandran
NASA/MSFC
InformaticsData Science
P
P. Fraccaro
IBM Research – Europe
Thomas Brunschwiler
Thomas Brunschwiler
IBM Research
Physics & AI for Climate Impact
Gabriele Cavallaro
Gabriele Cavallaro
Forschungszentrum Jülich and University of Iceland
Remote SensingMachine LearningHigh Performance ComputingQuantum Computing
J
Juan Bernabe-Moreno
IBM Research – Europe
N
Nicolas Long'épée
European Space Agency Φ-Lab