OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism

๐Ÿ“… 2026-04-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

221K/year
๐Ÿค– AI Summary
This work addresses the limitations of generalized category discovery (GCD)โ€”namely its reliance on single-modality inputs and dataset-specific fine-tuningโ€”by introducing the first modality-agnostic, zero-shot GCD method. The approach leverages a multimodal encoder to extract features and constructs a unified latent space for GCD, decoupling representation learning from category discovery. A Transformer model pretrained on synthetic data is employed to refine clustering representations at test time, eliminating the need for any dataset-specific fine-tuning. The framework supports four distinct modalities: vision, text, audio, and remote sensing. Evaluated across 16 cross-modal datasets, it achieves average accuracy gains of 6.2, 17.9, 1.5, and 12.7 percentage points on known and novel class classification over current baselines, demonstrating substantial performance improvements without task-specific adaptation.

Technology Category

Application Category

๐Ÿ“ Abstract
Generalized Category Discovery (GCD) challenges methods to identify known and novel classes using partially labeled data, mirroring human category learning. Unlike prior GCD methods, which operate within a single modality and require dataset-specific fine-tuning, we propose a modality-agnostic GCD approach inspired by the human brain's abstract category formation. Our $\textbf{OmniGCD}$ leverages modality-specific encoders (e.g., vision, audio, text, remote sensing) to process inputs, followed by dimension reduction to construct a $\textbf{GCD latent space}$, which is transformed at test-time into a representation better suited for clustering using a novel synthetically trained Transformer-based model. To evaluate OmniGCD, we introduce a $\textbf{zero-shot GCD setting}$ where no dataset-specific fine-tuning is allowed, enabling modality-agnostic category discovery. $\textbf{Trained once on synthetic data}$, OmniGCD performs zero-shot GCD across 16 datasets spanning four modalities, improving classification accuracy for known and novel classes over baselines (average percentage point improvement of $\textbf{+6.2}$, $\textbf{+17.9}$, $\textbf{+1.5}$ and $\textbf{+12.7}$ for vision, text, audio and remote sensing). This highlights the importance of strong encoders while decoupling representation learning from category discovery. Improving modality-agnostic methods will propagate across modalities, enabling encoder development independent of GCD. Our work serves as a benchmark for future modality-agnostic GCD works, paving the way for scalable, human-inspired category discovery. All code is available $\href{https://github.com/Jordan-HS/OmniGCD}{here}$
Problem

Research questions and friction points this paper is trying to address.

Generalized Category Discovery
Modality Agnosticism
Zero-shot Learning
Cross-modal
Category Discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

modality-agnostic
Generalized Category Discovery
zero-shot GCD
synthetic training
latent space transformation