๐ค AI Summary
Remote sensing imagery exhibits high variability in building structures, fragmented feature pyramids, and insufficient global-local feature fusionโleading to ambiguous segmentation boundaries and degraded accuracy. To address these challenges, we propose an uncertainty-guided end-to-end building extraction framework. Our key contributions are: (1) a hybrid CNN-Transformer encoder that jointly captures local details and long-range dependencies; (2) a Cross-level Interaction Block (CIB) enabling bidirectional compensation across pyramid levels; (3) a Global-Local Fusion (GLF) module to enhance semantic consistency; and (4) an Uncertainty-Aware Decoder (UAD) that models pixel-wise prediction confidence to guide optimization. Evaluated on multiple benchmark remote sensing datasets, our method achieves state-of-the-art performance, significantly improving boundary sharpness and recall for small-scale buildings. The source code is publicly available.
๐ Abstract
Building extraction from remote sensing images is a challenging task due to the complex structure variations of the buildings. Existing methods employ convolutional or self-attention blocks to capture the multi-scale features in the segmentation models, while the inherent gap of the feature pyramids and insufficient global-local feature integration leads to inaccurate, ambiguous extraction results. To address this issue, in this paper, we present an Uncertainty-Aggregated Global-Local Fusion Network (UAGLNet), which is capable to exploit high-quality global-local visual semantics under the guidance of uncertainty modeling. Specifically, we propose a novel cooperative encoder, which adopts hybrid CNN and transformer layers at different stages to capture the local and global visual semantics, respectively. An intermediate cooperative interaction block (CIB) is designed to narrow the gap between the local and global features when the network becomes deeper. Afterwards, we propose a Global-Local Fusion (GLF) module to complementarily fuse the global and local representations. Moreover, to mitigate the segmentation ambiguity in uncertain regions, we propose an Uncertainty-Aggregated Decoder (UAD) to explicitly estimate the pixel-wise uncertainty to enhance the segmentation accuracy. Extensive experiments demonstrate that our method achieves superior performance to other state-of-the-art methods. Our code is available at https://github.com/Dstate/UAGLNet