Context-Aware Semantic Segmentation via Stage-Wise Attention

📅 2026-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of semantic segmentation in ultra-high-resolution remote sensing imagery, where Transformers often incur excessive memory costs and struggle to balance global context modeling with fine-grained detail preservation. To this end, we propose CASWiT, a dual-branch Swin architecture comprising a context encoder for capturing long-range dependencies and a detail branch for retaining high-resolution features. Efficient fusion is achieved through stage-wise cross-scale attention and a gated feature injection mechanism. Additionally, we introduce a SimMIM-style masked self-supervised pretraining strategy that jointly reconstructs corresponding regions at both high and low resolutions. Our method achieves 65.83% mIoU on the IGN FLAIR-HUB dataset and 49.1% mIoU on URUR, significantly outperforming existing approaches and establishing a new state of the art on URUR.

Technology Category

Application Category

📝 Abstract
Semantic ultra high resolution image (UHR) segmentation is essential in remote sensing applications such as aerial mapping and environmental monitoring. Transformer-based models struggle in this setting because memory grows quadratically with token count, constraining either the contextual scope or the spatial resolution. We introduce CASWiT (Context-Aware Stage-Wise Transformer), a dual-branch, Swin-based architecture that injects global cues into fine-grained UHR features. A context encoder processes a downsampled neighborhood to capture long-range dependencies, while a high resolution encoder extracts detailed features from UHR patches. A cross-scale fusion module, combining cross-attention and gated feature injection, enriches high-resolution tokens with context. Beyond architecture, we propose a SimMIM-style pretraining. We mask 75% of the high-resolution image tokens and the low-resolution center region that spatially corresponds to the UHR patch, then train the shared dual-encoder with small decoder to reconstruct the UHR initial image. Extensive experiments on the large-scale IGN FLAIR-HUB aerial dataset demonstrate the effectiveness of CASWiT. Our method achieves 65.83% mIoU, outperforming RGB baselines by 1.78 points. On URUR, CASWiT achieves 49.1% mIoU, surpassing the current SoTA by +0.9% under the official evaluation protocol. All codes are provided on: https://huggingface.co/collections/heig-vd-geo/caswit.
Problem

Research questions and friction points this paper is trying to address.

Semantic Segmentation
Ultra High Resolution
Remote Sensing
Context Awareness
Transformer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-Aware Segmentation
Stage-Wise Transformer
Cross-Scale Fusion
UHR Semantic Segmentation
Masked Image Pretraining
🔎 Similar Papers
No similar papers found.
A
Antoine Carreaud
ESO lab. EPFL, 1015 Lausanne, Switzerland; University of Applied Sciences Western Switzerland (HES-SO / HEIG-VD), Yverdon-les-Bains, Switzerland
E
Elias Naha
ESO lab. EPFL, 1015 Lausanne, Switzerland; University of Applied Sciences Western Switzerland (HES-SO / HEIG-VD), Yverdon-les-Bains, Switzerland
A
Arthur Chansel
ESO lab. EPFL, 1015 Lausanne, Switzerland; University of Applied Sciences Western Switzerland (HES-SO / HEIG-VD), Yverdon-les-Bains, Switzerland
N
Nina Lahellec
ESO lab. EPFL, 1015 Lausanne, Switzerland; University of Applied Sciences Western Switzerland (HES-SO / HEIG-VD), Yverdon-les-Bains, Switzerland
Jan Skaloud
Jan Skaloud
Prof. titulaire EPFL
Adrien Gressin
Adrien Gressin
HEIG-VD
PhotogrammetryRemote sensingchange detection3d point cloud registration