BLIA: Detect model memorization in binary classification model through passive Label Inference attack

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This paper reveals a pervasive label memorization phenomenon in binary classification models, posing threats to generalization and data privacy. To address this, we propose two passive black-box label inference attacks (BLIAs) that identify memorized samples solely from model outputs—e.g., confidence scores or log-loss—without requiring model access, gradient information, or interactive queries, thereby establishing the “passive label inference” paradigm for the first time. Using controlled canary label-flipping experiments, we demonstrate that label memorization persists significantly under standard training and even under Label-Differential Privacy (Label-DP) protection. Comparative analysis with randomized response mechanisms further shows that existing Label-DP schemes fail to effectively suppress this phenomenon. Evaluated across diverse models, our BLIAs achieve an average attack success rate exceeding 50%, substantially outperforming random baselines. These results robustly confirm the ubiquity of label memorization and present a substantive challenge to current differential privacy frameworks.

Technology Category

Application Category

📝 Abstract

Model memorization has implications for both the generalization capacity of machine learning models and the privacy of their training data. This paper investigates label memorization in binary classification models through two novel passive label inference attacks (BLIA). These attacks operate passively, relying solely on the outputs of pre-trained models, such as confidence scores and log-loss values, without interacting with or modifying the training process. By intentionally flipping 50% of the labels in controlled subsets, termed"canaries,"we evaluate the extent of label memorization under two conditions: models trained without label differential privacy (Label-DP) and those trained with randomized response-based Label-DP. Despite the application of varying degrees of Label-DP, the proposed attacks consistently achieve success rates exceeding 50%, surpassing the baseline of random guessing and conclusively demonstrating that models memorize training labels, even when these labels are deliberately uncorrelated with the features.

Problem

Research questions and friction points this paper is trying to address.

Detect model memorization in binary classification models

Evaluate label memorization using passive label inference attacks

Assess impact of label differential privacy on memorization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Passive label inference attacks detect memorization.

Uses confidence scores and log-loss values.

Tests memorization with and without Label-DP.

🔎 Similar Papers

A Probabilistic Fluctuation based Membership Inference Attack for Diffusion Models