🤖 AI Summary
Weak generalization and poor cross-dataset adaptability plague multimodal remote sensing classification. To address this, we propose SpatialNet-ViT—a vision transformer-based multitask learning framework that jointly integrates spatial awareness and contextual understanding. It is the first to deeply couple the ViT architecture with multitask learning, incorporating a spatial enhancement module and a shared-specialized dual-path decoder to jointly optimize land-use classification, object presence detection, and urban-rural segmentation. Leveraging cross-modal data augmentation and hierarchical transfer strategies, the model significantly improves robustness and generalization. Evaluated on multiple public remote sensing benchmarks—including EuroSAT, UC-Merced, and RSSCN7—our method achieves average classification accuracy gains of 3.2–5.8% over baselines and surpasses state-of-the-art models in cross-domain transfer performance, demonstrating its effectiveness, stability, and scalability.
📝 Abstract
Remote sensing datasets offer significant promise for tackling key classification tasks such as land-use categorization, object presence detection, and rural/urban classification. However, many existing studies tend to focus on narrow tasks or datasets, which limits their ability to generalize across various remote sensing classification challenges. To overcome this, we propose a novel model, SpatialNet-ViT, leveraging the power of Vision Transformers (ViTs) and Multi-Task Learning (MTL). This integrated approach combines spatial awareness with contextual understanding, improving both classification accuracy and scalability. Additionally, techniques like data augmentation, transfer learning, and multi-task learning are employed to enhance model robustness and its ability to generalize across diverse datasets