🤖 AI Summary
Existing open-set domain generalization (OSDG) methods struggle to jointly optimize structural risk and open-space risk, particularly yielding overconfident predictions for “hard unknown” classes with fine-grained visual similarity to known categories. To address this, we propose a semantic-enhanced OSDG framework. First, we introduce semantic-aware prompt learning to explicitly encode semantic priors of known classes. Second, we design a dual-contrastive learning mechanism that jointly optimizes the decision boundary via “known-known cohesion” and “known-unknown separation.” Third, we leverage a CLIP-guided semantic diffusion model to synthesize high-fidelity pseudo-unknown samples, thereby strengthening hard negative learning. Evaluated on five standard benchmarks, our method achieves average improvements of +3.0% in accuracy and +5.0% in H-score over state-of-the-art approaches. It is the first OSDG framework to achieve decoupled modeling of known and unknown risks under fine-grained semantic guidance.
📝 Abstract
Open-Set Domain Generalization (OSDG) tackles the realistic scenario where deployed models encounter both domain shifts and novel object categories. Despite impressive progress with vision-language models like CLIP, existing methods still fall into the dilemma between structural risk of known-classes and open-space risk from unknown-classes, and easily suffers from over-confidence, especially when distinguishing ``hard unknowns" that share fine-grained visual similarities with known classes. To this end, we propose a Semantic-enhanced CLIP (SeeCLIP) framework that explicitly addresses this dilemma through fine-grained semantic enhancement. In SeeCLIP, we propose a semantic-aware prompt enhancement module to decompose images into discriminative semantic tokens, enabling nuanced vision-language alignment beyond coarse category labels. To position unknown prompts effectively, we introduce duplex contrastive learning with complementary objectives, that is, repulsion to maintain separability from known classes, and cohesion to preserve semantic proximity. Further, our semantic-guided diffusion module synthesizes pseudo-unknowns by perturbing extracted semantic tokens, generating challenging samples that are visually similar to known classes yet exhibit key local differences. These hard negatives force the model to learn finer decision boundaries. Extensive experiments across five benchmarks demonstrate consistent improvements of 3% accuracy and 5% H-score over state-of-the-art methods.