PyViT-FUSE: A Foundation Model for Multi-Sensor Earth Observation Data

📅 2025-04-26

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing methods struggle with fusing heterogeneous-resolution, arbitrary-band multisource remote sensing data. Method: We propose a foundation model for Earth observation featuring a scalable multimodal fusion attention mechanism that adaptively integrates mixed-resolution inputs; a pyramid Vision Transformer (ViT) architecture with variable-granularity patching and token aggregation for unified representation learning; and SwAV-based self-supervised pretraining across diverse sensors at global scale. Contributions/Results: (1) First end-to-end framework supporting arbitrary combinations of spectral band count and spatial resolution; (2) Significant performance gains on downstream tasks—including land-cover classification and change detection; (3) Attention visualization confirms interpretable, physically meaningful cross-resolution and cross-sensor feature fusion; (4) State-of-the-art transfer performance on multi-sensor benchmarks.

Technology Category

Application Category

📝 Abstract

We propose PyViT-FUSE, a foundation model for earth observation data explicitly designed to handle multi-modal imagery by learning to fuse an arbitrary number of mixed-resolution input bands into a single representation through an attention mechanism. The learned patch tokens are further processed by a stack of vision transformers with a novel pyramidal structure. We train the model on a globally sampled dataset in a self-supervised manner, leveraging core concepts of the SwAV algorithm. We show the interpretability of the fusion mechanism by visualization of the attention scores and the models applicability to downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Handles multi-modal earth observation imagery fusion

Learns single representation from mixed-resolution input bands

Demonstrates interpretability and downstream task applicability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses multi-modal imagery via attention mechanism

Uses pyramidal vision transformer architecture

Self-supervised training with SwAV algorithm

🔎 Similar Papers

No similar papers found.