Decouple and Rectify: Semantics-Preserving Structural Enhancement for Open-Vocabulary Remote Sensing Segmentation

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in open-vocabulary remote sensing semantic segmentation where CLIP models struggle to simultaneously preserve language-aligned semantic integrity and fine-grained spatial structure. To this end, the authors propose DR-Seg, a novel framework that, for the first time, reveals the functional heterogeneity of CLIP feature channels and decouples them into semantic-dominant and structure-dominant subspaces. The method incorporates DINO-derived structural priors to guide graph-based refinement and introduces an uncertainty-aware adaptive fusion mechanism to integrate predictions from both branches. Evaluated on eight remote sensing benchmarks, DR-Seg achieves new state-of-the-art performance, significantly improving boundary accuracy and open-vocabulary generalization while enabling targeted structural enhancement under preserved semantic fidelity.
📝 Abstract
Open-vocabulary semantic segmentation in the remote sensing (RS) field requires both language-aligned recognition and fine-grained spatial delineation. Although CLIP offers robust semantic generalization, its global-aligned visual representations inherently struggle to capture structural details. Recent methods attempt to compensate for this by introducing RS-pretrained DINO features. However, these methods treat CLIP representations as a monolithic semantic space and cannot localize where structural enhancement is required, failing to effectively delineate boundaries while risking the disruption of CLIP's semantic integrity. To address this limitation, we propose DR-Seg, a novel decouple-and-rectify framework in this paper. Our method is motivated by the key observation that CLIP feature channels exhibit distinct functional heterogeneity rather than forming a uniform semantic space. Building on this insight, DR-Seg decouples CLIP features into semantics-dominated and structure-dominated subspaces, enabling targeted structural enhancement by DINO without distorting language-aligned semantics. Subsequently, a prior-driven graph rectification module injects high-fidelity structural priors under DINO guidance to form a refined branch, while an uncertainty-guided adaptive fusion module dynamically integrates this refined branch with the original CLIP branch for final prediction. Comprehensive experiments across eight benchmarks demonstrate that DR-Seg establishes a new state-of-the-art.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary segmentation
remote sensing
semantic integrity
structural enhancement
CLIP
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouple-and-Rectify
Open-Vocabulary Segmentation
CLIP-DINO Fusion
Structural Enhancement
Remote Sensing
🔎 Similar Papers
No similar papers found.