A Survey of Text Classification Under Class Distribution Shift

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper systematically investigates generalization failure in text classification caused by class distribution shift, particularly under dynamic evolution of class sets between training and testing phases. We conduct a comprehensive literature review and formal modeling to unify three mainstream paradigms—open-set learning, zero-shot learning, and Universum learning—revealing their shared modeling principles and distinct constraint assumptions. We identify continual learning as the key approach for handling incremental and decremental class changes, and empirically validate its effectiveness through a principled evaluation framework. Our contributions include: (1) establishing a structured taxonomy of methodologies; (2) identifying multiple promising research directions; and (3) open-sourcing a continuously updated, curated literature repository for open-set text classification on GitHub—thereby advancing standardization and reproducibility in the field.

Technology Category

Application Category

📝 Abstract
The basic underlying assumption of machine learning (ML) models is that the training and test data are sampled from the same distribution. However, in daily practice, this assumption is often broken, i.e.~the distribution of the test data changes over time, which hinders the application of conventional ML models. One domain where the distribution shift naturally occurs is text classification, since people always find new topics to discuss. To this end, we survey research articles studying open-set text classification and related tasks. We divide the methods in this area based on the constraints that define the kind of distribution shift and the corresponding problem formulation, i.e.~learning with the Universum, zero-shot learning, and open-set learning. We next discuss the predominant mitigation approaches for each problem setup. Finally, we identify several future work directions, aiming to push the boundaries beyond the state of the art. Interestingly, we find that continual learning can solve many of the issues caused by the shifting class distribution. We maintain a list of relevant papers at https://github.com/Eduard6421/Open-Set-Survey.
Problem

Research questions and friction points this paper is trying to address.

Text classification under distribution shift
Mitigation approaches for shifting data
Continual learning for class distribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Survey of text classification
Open-set learning methods
Continual learning solutions
🔎 Similar Papers
No similar papers found.
A
Adriana Valentina Costache
University of Bucharest, Bucharest, Romania
S
Silviu Florin Gheorghe
University of Bucharest, Bucharest, Romania
E
Eduard Poesina
University of Bucharest, Bucharest, Romania
Paul Irofti
Paul Irofti
Associate Professor, University of Bucharest
anomaly detectionCyberAIsecuritydictionary learningoperating systems
R
R. Ionescu
University of Bucharest, Bucharest, Romania