Context-Aware Semantic Segmentation via Stage-Wise Attention

📅 2026-01-16

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the challenges of semantic segmentation in ultra-high-resolution remote sensing imagery, where Transformers often incur excessive memory costs and struggle to balance global context modeling with fine-grained detail preservation. To this end, we propose CASWiT, a dual-branch Swin architecture comprising a context encoder for capturing long-range dependencies and a detail branch for retaining high-resolution features. Efficient fusion is achieved through stage-wise cross-scale attention and a gated feature injection mechanism. Additionally, we introduce a SimMIM-style masked self-supervised pretraining strategy that jointly reconstructs corresponding regions at both high and low resolutions. Our method achieves 65.83% mIoU on the IGN FLAIR-HUB dataset and 49.1% mIoU on URUR, significantly outperforming existing approaches and establishing a new state of the art on URUR.

Technology Category

Application Category

📝 Abstract

Semantic ultra high resolution image (UHR) segmentation is essential in remote sensing applications such as aerial mapping and environmental monitoring. Transformer-based models struggle in this setting because memory grows quadratically with token count, constraining either the contextual scope or the spatial resolution. We introduce CASWiT (Context-Aware Stage-Wise Transformer), a dual-branch, Swin-based architecture that injects global cues into fine-grained UHR features. A context encoder processes a downsampled neighborhood to capture long-range dependencies, while a high resolution encoder extracts detailed features from UHR patches. A cross-scale fusion module, combining cross-attention and gated feature injection, enriches high-resolution tokens with context. Beyond architecture, we propose a SimMIM-style pretraining. We mask 75% of the high-resolution image tokens and the low-resolution center region that spatially corresponds to the UHR patch, then train the shared dual-encoder with small decoder to reconstruct the UHR initial image. Extensive experiments on the large-scale IGN FLAIR-HUB aerial dataset demonstrate the effectiveness of CASWiT. Our method achieves 65.83% mIoU, outperforming RGB baselines by 1.78 points. On URUR, CASWiT achieves 49.1% mIoU, surpassing the current SoTA by +0.9% under the official evaluation protocol. All codes are provided on: https://huggingface.co/collections/heig-vd-geo/caswit.

Problem

Research questions and friction points this paper is trying to address.

Semantic Segmentation

Ultra High Resolution

Remote Sensing

Context Awareness

Transformer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-Aware Segmentation

Stage-Wise Transformer

Cross-Scale Fusion