π€ AI Summary
This paper addresses the challenge of identifying extremely rare positive-class instances (e.g., novel attacks, high-risk users) in longitudinal telemetry data, where positive samples are severely scarce, negative samples are heterogeneous and weakly labeled across multiple sources, and abundant unlabeled data leads to insufficient prior knowledge. We propose the first weakly supervised, semi-supervised GAN framework jointly modeling sparse positives and multi-source heterogeneous negatives. Our method introduces a tolerance term to relax noisy negative-label constraints, designs a dual-path discriminator and generator loss to unify the modeling of diverse negative distributions and sparse positives, and integrates weakly supervised discriminative learning with generative data augmentation. Evaluated on a skill-game risk-control task, our approach achieves 85% recall for rare classesβ60% higher than state-of-the-art baselines. On KDDCUP99, it successfully detects an attack class constituting only 0.09% of the data, establishing a new benchmark for ultra-rare-class detection.
π Abstract
This paper introduces ScarceGAN which focuses on identification of extremely rare or scarce samples from multi-dimensional longitudinal telemetry data with small and weak label prior. We specifically address: (i) severe scarcity in positive class, stemming from both underlying organic skew in the data, as well as extremely limited labels; (ii) multi-class nature of the negative samples, with uneven density distributions and partially overlapping feature distributions; and (iii) massively unlabelled data leading to tiny and weak prior on both positive and negative classes, and possibility of unseen or unknown behavior in the unlabelled set, especially in the negative class. Although related to PU learning problems, we contend that knowledge (or lack of it) on the negative class can be leveraged to learn the compliment of it (i.e., the positive class) better in a semi-supervised manner. To this effect, ScarceGAN re-formulates semi-supervised GAN by accommodating weakly labelled multi- class negative samples and the available positive samples. It relaxes the supervised discriminator's constraint on exact differentiation be- tween negative samples by introducing a 'leeway' term for samples with noisy prior. We propose modifications to the cost objectives of discriminator, in supervised and unsupervised path as well as that of the generator. For identifying risky players in skill gaming, this formulation in whole gives us a recall of over 85% (~60% jump over vanilla semi-supervised GAN) on our scarce class with very minimal verbosity in the unknown space. Further ScarceGAN out- performs the recall benchmarks established by recent GAN based specialized models for the positive imbalanced class identification and establishes a new benchmark in identifying one of rare attack classes (0.09%) in the intrusion dataset from the KDDCUP99 challenge. We establish ScarceGAN to be one of new competitive benchmark frameworks in the rare class identification for longitudinal telemetry data.