Multi-encoder ConvNeXt Network with Smooth Attentional Feature Fusion for Multispectral Semantic Segmentation

📅 2026-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of limited semantic segmentation accuracy in multispectral remote sensing imagery caused by the difficulty of effectively fusing visible and non-visible spectral bands. To this end, the authors propose a dual-branch ConvNeXt encoder–decoder architecture that processes the two spectral modalities separately and integrates them through a multi-scale fusion decoder to combine spatial details with high-level semantic features. The method innovatively incorporates a smooth attention mechanism and the ASAU activation function to enable efficient, flexible spectral feature fusion, supporting arbitrary input band configurations. Evaluated on the FBP and Potsdam datasets, the proposed approach significantly outperforms mainstream models such as U-Net, DeepLabV3+, and SegFormer, achieving up to a 19.62% improvement in mIoU. A lightweight variant further reduces computational cost while maintaining competitive performance.

Technology Category

Application Category

📝 Abstract
This work proposes MeCSAFNet, a multi-branch encoder-decoder architecture for land cover segmentation in multispectral imagery. The model separately processes visible and non-visible channels through dual ConvNeXt encoders, followed by individual decoders that reconstruct spatial information. A dedicated fusion decoder integrates intermediate features at multiple scales, combining fine spatial cues with high-level spectral representations. The feature fusion is further enhanced with CBAM attention, and the ASAU activation function contributes to stable and efficient optimization. The model is designed to process different spectral configurations, including a 4-channel (4c) input combining RGB and NIR bands, as well as a 6-channel (6c) input incorporating NDVI and NDWI indices. Experiments on the Five-Billion-Pixels (FBP) and Potsdam datasets demonstrate significant performance gains. On FBP, MeCSAFNet-base (6c) surpasses U-Net (4c) by +19.21%, U-Net (6c) by +14.72%, SegFormer (4c) by +19.62%, and SegFormer (6c) by +14.74% in mIoU. On Potsdam, MeCSAFNet-large (4c) improves over DeepLabV3+ (4c) by +6.48%, DeepLabV3+ (6c) by +5.85%, SegFormer (4c) by +9.11%, and SegFormer (6c) by +4.80% in mIoU. The model also achieves consistent gains over several recent state-of-the-art approaches. Moreover, compact variants of MeCSAFNet deliver notable performance with lower training time and reduced inference cost, supporting their deployment in resource-constrained environments.
Problem

Research questions and friction points this paper is trying to address.

multispectral semantic segmentation
land cover segmentation
multi-encoder
feature fusion
ConvNeXt
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-encoder
feature fusion
ConvNeXt
attention mechanism
multispectral segmentation
🔎 Similar Papers
No similar papers found.
L
Leo Thomas Ramos
Computer Vision Center, Universitat Autònoma de Barcelona, Barcelona, 08193, Spain
Angel D. Sappa
Angel D. Sappa
ESPOL Polytechnic University (Ecuador) and Computer Vision Center (Spain)
Image ProcessingComputer Vision3D Vision3D Modeling