Temporal-Spatial Tubelet Embedding for Cloud-Robust MSI Reconstruction using MSI-SAR Fusion: A Multi-Head Self-Attention Video Vision Transformer Approach

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cloud contamination severely distorts the spectral signatures of multispectral imagery (MSI), impeding reliable early-stage crop mapping. To address this, we propose a spatio-temporal tubular embedding framework that fuses synthetic aperture radar (SAR) and MSI for robust reconstruction. Departing from conventional vision transformer (ViT) approaches that employ coarse temporal aggregation, our method introduces a novel non-overlapping, short-term (t=2) 3D tubular slicing mechanism—preserving local temporal consistency while mitigating inter-day information decay. We further pioneer the adaptation of video vision transformers (ViViT) to MSI–SAR temporal fusion, integrating 3D convolutional feature extraction, cross-modal alignment, and multi-head self-attention fusion. Evaluated on the 2020 Traill County dataset, our method reduces mean squared error (MSE) by 2.23% over single-source MSI reconstruction; incorporating SAR yields an additional 10.33% relative improvement over the baseline, significantly enhancing the robustness of agricultural remote sensing under cloud occlusion.

Technology Category

Application Category

📝 Abstract
Cloud cover in multispectral imagery (MSI) significantly hinders early-season crop mapping by corrupting spectral information. Existing Vision Transformer(ViT)-based time-series reconstruction methods, like SMTS-ViT, often employ coarse temporal embeddings that aggregate entire sequences, causing substantial information loss and reducing reconstruction accuracy. To address these limitations, a Video Vision Transformer (ViViT)-based framework with temporal-spatial fusion embedding for MSI reconstruction in cloud-covered regions is proposed in this study. Non-overlapping tubelets are extracted via 3D convolution with constrained temporal span $(t=2)$, ensuring local temporal coherence while reducing cross-day information degradation. Both MSI-only and SAR-MSI fusion scenarios are considered during the experiments. Comprehensive experiments on 2020 Traill County data demonstrate notable performance improvements: MTS-ViViT achieves a 2.23% reduction in MSE compared to the MTS-ViT baseline, while SMTS-ViViT achieves a 10.33% improvement with SAR integration over the SMTS-ViT baseline. The proposed framework effectively enhances spectral reconstruction quality for robust agricultural monitoring.
Problem

Research questions and friction points this paper is trying to address.

Reconstructs cloud-covered multispectral imagery for crop mapping
Improves temporal embedding to reduce information loss in sequences
Integrates SAR data to enhance reconstruction accuracy in agriculture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video Vision Transformer with temporal-spatial tubelet embedding
3D convolution extracts non-overlapping tubelets for local coherence
Multi-head self-attention fuses MSI and SAR data
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30