ScarceGAN: Discriminative Classification Framework for Rare Class Identification for Longitudinal Data with Weak Prior

πŸ“… 2021-10-26
πŸ›οΈ International Conference on Information and Knowledge Management
πŸ“ˆ Citations: 3
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses the challenge of identifying extremely rare positive-class instances (e.g., novel attacks, high-risk users) in longitudinal telemetry data, where positive samples are severely scarce, negative samples are heterogeneous and weakly labeled across multiple sources, and abundant unlabeled data leads to insufficient prior knowledge. We propose the first weakly supervised, semi-supervised GAN framework jointly modeling sparse positives and multi-source heterogeneous negatives. Our method introduces a tolerance term to relax noisy negative-label constraints, designs a dual-path discriminator and generator loss to unify the modeling of diverse negative distributions and sparse positives, and integrates weakly supervised discriminative learning with generative data augmentation. Evaluated on a skill-game risk-control task, our approach achieves 85% recall for rare classesβ€”60% higher than state-of-the-art baselines. On KDDCUP99, it successfully detects an attack class constituting only 0.09% of the data, establishing a new benchmark for ultra-rare-class detection.

Technology Category

Application Category

πŸ“ Abstract
This paper introduces ScarceGAN which focuses on identification of extremely rare or scarce samples from multi-dimensional longitudinal telemetry data with small and weak label prior. We specifically address: (i) severe scarcity in positive class, stemming from both underlying organic skew in the data, as well as extremely limited labels; (ii) multi-class nature of the negative samples, with uneven density distributions and partially overlapping feature distributions; and (iii) massively unlabelled data leading to tiny and weak prior on both positive and negative classes, and possibility of unseen or unknown behavior in the unlabelled set, especially in the negative class. Although related to PU learning problems, we contend that knowledge (or lack of it) on the negative class can be leveraged to learn the compliment of it (i.e., the positive class) better in a semi-supervised manner. To this effect, ScarceGAN re-formulates semi-supervised GAN by accommodating weakly labelled multi- class negative samples and the available positive samples. It relaxes the supervised discriminator's constraint on exact differentiation be- tween negative samples by introducing a 'leeway' term for samples with noisy prior. We propose modifications to the cost objectives of discriminator, in supervised and unsupervised path as well as that of the generator. For identifying risky players in skill gaming, this formulation in whole gives us a recall of over 85% (~60% jump over vanilla semi-supervised GAN) on our scarce class with very minimal verbosity in the unknown space. Further ScarceGAN out- performs the recall benchmarks established by recent GAN based specialized models for the positive imbalanced class identification and establishes a new benchmark in identifying one of rare attack classes (0.09%) in the intrusion dataset from the KDDCUP99 challenge. We establish ScarceGAN to be one of new competitive benchmark frameworks in the rare class identification for longitudinal telemetry data.
Problem

Research questions and friction points this paper is trying to address.

Identify rare samples from multi-dimensional longitudinal data with weak labels
Address severe scarcity and multi-class imbalance in negative samples
Leverage unlabeled data to improve rare class detection semi-supervisedly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-supervised GAN for rare class identification
Weakly labelled multi-class negative samples accommodation
Modified cost objectives for discriminator and generator
πŸ”Ž Similar Papers
No similar papers found.
S
Surajit Chakrabarty
Artificial Intelligence and Data Science, Games24x7, India
R
Rukma Talwadker
Artificial Intelligence and Data Science, Games24x7, India
Tridib Mukherjee
Tridib Mukherjee
Games24x7
Artificial IntelligenceOutcome-based Interactive PlatformsBehavior Modeling & PersonalizationGame Intelligence & Informat