UAGLNet: Uncertainty-Aggregated Global-Local Fusion Network with Cooperative CNN-Transformer for Building Extraction

๐Ÿ“… 2025-12-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Remote sensing imagery exhibits high variability in building structures, fragmented feature pyramids, and insufficient global-local feature fusionโ€”leading to ambiguous segmentation boundaries and degraded accuracy. To address these challenges, we propose an uncertainty-guided end-to-end building extraction framework. Our key contributions are: (1) a hybrid CNN-Transformer encoder that jointly captures local details and long-range dependencies; (2) a Cross-level Interaction Block (CIB) enabling bidirectional compensation across pyramid levels; (3) a Global-Local Fusion (GLF) module to enhance semantic consistency; and (4) an Uncertainty-Aware Decoder (UAD) that models pixel-wise prediction confidence to guide optimization. Evaluated on multiple benchmark remote sensing datasets, our method achieves state-of-the-art performance, significantly improving boundary sharpness and recall for small-scale buildings. The source code is publicly available.

Technology Category

Application Category

๐Ÿ“ Abstract
Building extraction from remote sensing images is a challenging task due to the complex structure variations of the buildings. Existing methods employ convolutional or self-attention blocks to capture the multi-scale features in the segmentation models, while the inherent gap of the feature pyramids and insufficient global-local feature integration leads to inaccurate, ambiguous extraction results. To address this issue, in this paper, we present an Uncertainty-Aggregated Global-Local Fusion Network (UAGLNet), which is capable to exploit high-quality global-local visual semantics under the guidance of uncertainty modeling. Specifically, we propose a novel cooperative encoder, which adopts hybrid CNN and transformer layers at different stages to capture the local and global visual semantics, respectively. An intermediate cooperative interaction block (CIB) is designed to narrow the gap between the local and global features when the network becomes deeper. Afterwards, we propose a Global-Local Fusion (GLF) module to complementarily fuse the global and local representations. Moreover, to mitigate the segmentation ambiguity in uncertain regions, we propose an Uncertainty-Aggregated Decoder (UAD) to explicitly estimate the pixel-wise uncertainty to enhance the segmentation accuracy. Extensive experiments demonstrate that our method achieves superior performance to other state-of-the-art methods. Our code is available at https://github.com/Dstate/UAGLNet
Problem

Research questions and friction points this paper is trying to address.

Extracts buildings from complex remote sensing images
Integrates global-local features to reduce segmentation ambiguity
Models pixel-wise uncertainty to improve extraction accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid CNN-Transformer encoder captures local and global semantics
Global-Local Fusion module integrates complementary representations
Uncertainty-Aggregated Decoder enhances accuracy via pixel-wise uncertainty
๐Ÿ”Ž Similar Papers
No similar papers found.
Siyuan Yao
Siyuan Yao
University of Notre Dame
VisualizationComputer GraphicsComputer Vision
Dongxiu Liu
Dongxiu Liu
Beijing University of Posts and Telecommunications
Robot ManipulationTask PlanningComputer Vision
T
Taotao Li
School of Cyber Science and Technology, Shenzhen Campus, Sun Yat-sen University, Shenzhen 518107, China
S
Shengjie Li
School of Computer Science (National Pilot Software Engineering School), Beijing University Of Posts and Telecommunications, Beijing 100876, China
W
Wenqi Ren
School of Cyber Science and Technology, Shenzhen Campus, Sun Yat-sen University, Shenzhen 518107, China
Xiaochun Cao
Xiaochun Cao
Sun Yat-sen University
Computer VisionArtificial IntelligenceMultimediaMachine Learning