PyViT-FUSE: A Foundation Model for Multi-Sensor Earth Observation Data

๐Ÿ“… 2025-04-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing methods struggle with fusing heterogeneous-resolution, arbitrary-band multisource remote sensing data. Method: We propose a foundation model for Earth observation featuring a scalable multimodal fusion attention mechanism that adaptively integrates mixed-resolution inputs; a pyramid Vision Transformer (ViT) architecture with variable-granularity patching and token aggregation for unified representation learning; and SwAV-based self-supervised pretraining across diverse sensors at global scale. Contributions/Results: (1) First end-to-end framework supporting arbitrary combinations of spectral band count and spatial resolution; (2) Significant performance gains on downstream tasksโ€”including land-cover classification and change detection; (3) Attention visualization confirms interpretable, physically meaningful cross-resolution and cross-sensor feature fusion; (4) State-of-the-art transfer performance on multi-sensor benchmarks.

Technology Category

Application Category

๐Ÿ“ Abstract
We propose PyViT-FUSE, a foundation model for earth observation data explicitly designed to handle multi-modal imagery by learning to fuse an arbitrary number of mixed-resolution input bands into a single representation through an attention mechanism. The learned patch tokens are further processed by a stack of vision transformers with a novel pyramidal structure. We train the model on a globally sampled dataset in a self-supervised manner, leveraging core concepts of the SwAV algorithm. We show the interpretability of the fusion mechanism by visualization of the attention scores and the models applicability to downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Handles multi-modal earth observation imagery fusion
Learns single representation from mixed-resolution input bands
Demonstrates interpretability and downstream task applicability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses multi-modal imagery via attention mechanism
Uses pyramidal vision transformer architecture
Self-supervised training with SwAV algorithm
๐Ÿ”Ž Similar Papers
No similar papers found.