Re-Evaluating the Impact of Unseen-Class Unlabeled Data on Semi-Supervised Learning Model

📅 2025-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a critical confounding-variable problem in conventional semi-supervised learning (SSL) evaluation: the uncontrolled presence of unseen-class unlabeled data. To address this, we propose the first controllable evaluation paradigm, wherein—while fixing the proportion of seen-class samples—we independently manipulate five key dimensions of unseen-class data: number of classes, sample count, semantic distance from seen classes, distributional shift, and label noise. We systematically analyze their impact on both global and local robustness. Extensive experiments across state-of-the-art SSL methods (e.g., FixMatch, UDA) on CIFAR-10/100 demonstrate that unseen-class data is not inherently detrimental; rather, its judicious incorporation can enhance model generalization and robustness, yielding up to a 2.8% absolute accuracy gain. These findings fundamentally challenge the prevailing assumption that unseen classes inevitably degrade SSL performance.

Technology Category

Application Category

📝 Abstract
Semi-supervised learning (SSL) effectively leverages unlabeled data and has been proven successful across various fields. Current safe SSL methods believe that unseen classes in unlabeled data harm the performance of SSL models. However, previous methods for assessing the impact of unseen classes on SSL model performance are flawed. They fix the size of the unlabeled dataset and adjust the proportion of unseen classes within the unlabeled data to assess the impact. This process contravenes the principle of controlling variables. Adjusting the proportion of unseen classes in unlabeled data alters the proportion of seen classes, meaning the decreased classification performance of seen classes may not be due to an increase in unseen class samples in the unlabeled data, but rather a decrease in seen class samples. Thus, the prior flawed assessment standard that ``unseen classes in unlabeled data can damage SSL model performance"may not always hold true. This paper strictly adheres to the principle of controlling variables, maintaining the proportion of seen classes in unlabeled data while only changing the unseen classes across five critical dimensions, to investigate their impact on SSL models from global robustness and local robustness. Experiments demonstrate that unseen classes in unlabeled data do not necessarily impair the performance of SSL models; in fact, under certain conditions, unseen classes may even enhance them.
Problem

Research questions and friction points this paper is trying to address.

Re-evaluates impact of unseen-class unlabeled data on SSL models.
Challenges flawed methods assessing unseen classes' effect on SSL performance.
Investigates unseen classes' influence on SSL models under controlled conditions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Controls seen class proportion in unlabeled data
Varies unseen classes across five dimensions
Assesses impact on SSL model robustness
🔎 Similar Papers
No similar papers found.
Rundong He
Rundong He
The Hong Kong Polytechnic University​
Trustworthy Machine Learning
Y
Yicong Dong
School of Software, Shandong University
L
Lanzhe Guo
School of Intelligence Science and Technology, Nanjing University
Y
Yilong Yin
School of Software, Shandong University
Tailin Wu
Tailin Wu
Assistant professor, Westlake University; previously postdoc@Stanford CS, PhD at MIT
AI for scientific simulation and designAI for scientific discoveryrepresentation learning