Stronger, Steadier&Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

In domain generalization for semantic segmentation (DGSS), visual cues are highly susceptible to domain shifts, whereas geometric information exhibits greater stability. To address this, we propose DepthForge: the first framework to introduce depth-aware learnable tokens that hierarchically decouple frozen visual features (from DINOv2/EVA02) from spatially invariant depth features (from frozen Depth Anything V2). We further design a depth-refinement decoder that adaptively fuses multi-level visual–depth representations. Our approach significantly enhances intra-image geometric consistency and cross-domain robustness. Evaluated on five unseen target domains, DepthForge surpasses state-of-the-art methods, achieving particularly strong generalization under extreme conditions—e.g., nighttime and snowy scenes. Qualitative analysis confirms improved stability of visual–spatial attention mechanisms.

Technology Category

Application Category

📝 Abstract

Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depth-aware learnable tokens to continuously decouple domain-invariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datsets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code is available at https://github.com/anonymouse-xzrptkvyqc/DepthForge.

Problem

Research questions and friction points this paper is trying to address.

Enhancing geometric consistency in VFMs for DGSS

Integrating depth cues with VFM features

Improving generalization under extreme conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates depth and visual cues from VFMs

Uses depth-aware learnable tokens per layer

Includes depth refinement decoder for features

🔎 Similar Papers

No similar papers found.