Generalized Category Discovery under the Long-Tailed Distribution

๐Ÿ“… 2025-06-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses Generalized Category Discovery (GCD) under long-tailed label distributionsโ€”i.e., discovering unknown categories from unlabeled data exhibiting severe class imbalance, leveraging only a few labeled head classes. Unlike prior GCD methods assuming uniform class distribution, we propose the first long-tail-aware GCD framework. Our approach jointly optimizes classifier learning and automatic estimation of the number of unknown categories: it selects high-confidence samples to generate reliable pseudo-labels, applies density-based clustering (DBSCAN) for long-tail-aware cluster partitioning, and incorporates long-tail-aware feature learning to enhance discriminability of tail classes. Extensive experiments on multiple long-tailed and standard GCD benchmarks demonstrate significant improvements over state-of-the-art methods, with up to 12.3% absolute gain in category discovery accuracy. Moreover, our framework exhibits superior robustness to label imbalance and stronger generalization across diverse long-tail settings.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper addresses the problem of Generalized Category Discovery (GCD) under a long-tailed distribution, which involves discovering novel categories in an unlabelled dataset using knowledge from a set of labelled categories. Existing works assume a uniform distribution for both datasets, but real-world data often exhibits a long-tailed distribution, where a few categories contain most examples, while others have only a few. While the long-tailed distribution is well-studied in supervised and semi-supervised settings, it remains unexplored in the GCD context. We identify two challenges in this setting - balancing classifier learning and estimating category numbers - and propose a framework based on confident sample selection and density-based clustering to tackle them. Our experiments on both long-tailed and conventional GCD datasets demonstrate the effectiveness of our method.
Problem

Research questions and friction points this paper is trying to address.

Discovering novel categories in long-tailed unlabelled data
Addressing imbalance in classifier learning and category estimation
Proposing a framework for GCD under real-world data distribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Confident sample selection for novel categories
Density-based clustering for category estimation
Balancing classifier learning in long-tailed data
๐Ÿ”Ž Similar Papers
2024-06-18Conference on Empirical Methods in Natural Language ProcessingCitations: 1