Generalized Category Discovery under Domain Shifts: From Vision to Vision-Language Models

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the challenging setting in generalized category discovery (GCD) where unlabeled data exhibit both domain shift and semantic shift. The study presents the first systematic investigation of this scenario and introduces three adaptation frameworks under a unified design principle. HiLo disentangles domain and semantic features through multi-level feature decoupling and mutual information minimization. HLPrompt leverages spatial prompt tuning to suppress background and domain-related noise. VLPrompt integrates vision-language models with decomposed textual prompts and cross-modal consistency regularization to enhance generalization. Evaluated on both synthetically corrupted and real-world multi-domain datasets, the proposed methods significantly outperform existing baselines while offering strong deployment flexibility and robustness.

📝 Abstract

Generalized Category Discovery (GCD) aims to categorize unlabelled instances from both known and unknown classes by transferring knowledge from labelled data of known classes. Existing methods assume all data comes from a single domain, yet real-world unlabelled data often exhibits domain shifts alongside semantic shifts. We study GCD under domain shifts and propose three frameworks that adapt foundation models, ranging from self-supervised vision models to vision-language models. (i) HiLo disentangles domain and semantic features through multi-level feature extraction and mutual information minimization, combined with PatchMix augmentation and curriculum sampling. (ii) HLPrompt extends HiLo with semantic-aware spatial prompt tuning to suppress background and domain noise. (iii) VLPrompt leverages vision-language models via factorized textual prompts and cross-modal consistency regularization. The three methods share core design principles while operating on different foundation backbones, making them suitable for different deployment scenarios. Extensive experiments on synthetic corruptions and real-world multi-domain shifts demonstrate consistent improvements over strong baselines. Project page: https://visual-ai.github.io/hilo/

Problem

Research questions and friction points this paper is trying to address.

Generalized Category Discovery

Domain Shifts

Unlabelled Data

Semantic Shifts

Vision-Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized Category Discovery

Domain Shift

Vision-Language Models