🤖 AI Summary
This paper reveals a pervasive label memorization phenomenon in binary classification models, posing threats to generalization and data privacy. To address this, we propose two passive black-box label inference attacks (BLIAs) that identify memorized samples solely from model outputs—e.g., confidence scores or log-loss—without requiring model access, gradient information, or interactive queries, thereby establishing the “passive label inference” paradigm for the first time. Using controlled canary label-flipping experiments, we demonstrate that label memorization persists significantly under standard training and even under Label-Differential Privacy (Label-DP) protection. Comparative analysis with randomized response mechanisms further shows that existing Label-DP schemes fail to effectively suppress this phenomenon. Evaluated across diverse models, our BLIAs achieve an average attack success rate exceeding 50%, substantially outperforming random baselines. These results robustly confirm the ubiquity of label memorization and present a substantive challenge to current differential privacy frameworks.
📝 Abstract
Model memorization has implications for both the generalization capacity of machine learning models and the privacy of their training data. This paper investigates label memorization in binary classification models through two novel passive label inference attacks (BLIA). These attacks operate passively, relying solely on the outputs of pre-trained models, such as confidence scores and log-loss values, without interacting with or modifying the training process. By intentionally flipping 50% of the labels in controlled subsets, termed"canaries,"we evaluate the extent of label memorization under two conditions: models trained without label differential privacy (Label-DP) and those trained with randomized response-based Label-DP. Despite the application of varying degrees of Label-DP, the proposed attacks consistently achieve success rates exceeding 50%, surpassing the baseline of random guessing and conclusively demonstrating that models memorize training labels, even when these labels are deliberately uncorrelated with the features.