ScarceGAN: Discriminative Classification Framework for Rare Class Identification for Longitudinal Data with Weak Prior

📅 2021-10-26

🏛️ International Conference on Information and Knowledge Management

📈 Citations: 3

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This paper addresses the challenge of identifying extremely rare positive-class instances (e.g., novel attacks, high-risk users) in longitudinal telemetry data, where positive samples are severely scarce, negative samples are heterogeneous and weakly labeled across multiple sources, and abundant unlabeled data leads to insufficient prior knowledge. We propose the first weakly supervised, semi-supervised GAN framework jointly modeling sparse positives and multi-source heterogeneous negatives. Our method introduces a tolerance term to relax noisy negative-label constraints, designs a dual-path discriminator and generator loss to unify the modeling of diverse negative distributions and sparse positives, and integrates weakly supervised discriminative learning with generative data augmentation. Evaluated on a skill-game risk-control task, our approach achieves 85% recall for rare classes—60% higher than state-of-the-art baselines. On KDDCUP99, it successfully detects an attack class constituting only 0.09% of the data, establishing a new benchmark for ultra-rare-class detection.

Technology Category

Application Category

📝 Abstract

This paper introduces ScarceGAN which focuses on identification of extremely rare or scarce samples from multi-dimensional longitudinal telemetry data with small and weak label prior. We specifically address: (i) severe scarcity in positive class, stemming from both underlying organic skew in the data, as well as extremely limited labels; (ii) multi-class nature of the negative samples, with uneven density distributions and partially overlapping feature distributions; and (iii) massively unlabelled data leading to tiny and weak prior on both positive and negative classes, and possibility of unseen or unknown behavior in the unlabelled set, especially in the negative class. Although related to PU learning problems, we contend that knowledge (or lack of it) on the negative class can be leveraged to learn the compliment of it (i.e., the positive class) better in a semi-supervised manner. To this effect, ScarceGAN re-formulates semi-supervised GAN by accommodating weakly labelled multi- class negative samples and the available positive samples. It relaxes the supervised discriminator's constraint on exact differentiation be- tween negative samples by introducing a 'leeway' term for samples with noisy prior. We propose modifications to the cost objectives of discriminator, in supervised and unsupervised path as well as that of the generator. For identifying risky players in skill gaming, this formulation in whole gives us a recall of over 85% (~60% jump over vanilla semi-supervised GAN) on our scarce class with very minimal verbosity in the unknown space. Further ScarceGAN out- performs the recall benchmarks established by recent GAN based specialized models for the positive imbalanced class identification and establishes a new benchmark in identifying one of rare attack classes (0.09%) in the intrusion dataset from the KDDCUP99 challenge. We establish ScarceGAN to be one of new competitive benchmark frameworks in the rare class identification for longitudinal telemetry data.

Problem

Research questions and friction points this paper is trying to address.

Identify rare samples from multi-dimensional longitudinal data with weak labels

Address severe scarcity and multi-class imbalance in negative samples

Leverage unlabeled data to improve rare class detection semi-supervisedly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-supervised GAN for rare class identification

Weakly labelled multi-class negative samples accommodation

Modified cost objectives for discriminator and generator

🔎 Similar Papers

No similar papers found.