RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Earth observation (EO) data exhibit significant heterogeneity across modalities and spatial resolutions; existing foundation models are constrained by fixed-input architectures or sensor-specific encoders, limiting their generalization. To address this, we propose a resolution-adaptable multimodal Transformer encoder—the first to treat spatial resolution as a controllable inference-time parameter—enabling flexible adjustment of detail levels and providing an explicit trade-off between computational cost and localization accuracy. We further introduce tunable positional encoding and masked reconstruction pretraining to establish a fully sensor-agnostic unified latent space. Evaluated on the PANGAEA benchmark, our method surpasses state-of-the-art approaches despite smaller model size, demonstrating superior generalization across diverse sensors and resolutions, as well as robust cross-configuration transferability in downstream tasks.

Technology Category

Application Category

📝 Abstract
Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at https://github.com/nicolashoudre/RAMEN.
Problem

Research questions and friction points this paper is trying to address.

Handles variable spatial, spectral, temporal resolutions in Earth observation data
Learns shared sensor-agnostic visual representations across heterogeneous modalities
Enables controllable spatial resolution for detail-computation trade-offs at inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Resolution-adjustable encoder for Earth Observation data
Treats resolution as controllable output parameter
Unified transformer for sensor-agnostic representation learning
🔎 Similar Papers
No similar papers found.