G-Drift MIA: Membership Inference via Gradient-Induced Feature Drift in LLMs

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the limited effectiveness of existing membership inference attacks (MIAs) on large language models when members and non-members are drawn from the same distribution, which hinders accurate assessment of training data privacy risks. The authors propose a white-box MIA that perturbs inputs via single-step targeted gradient ascent and measures resulting shifts in internal model representations—such as logits and hidden-layer activations. For the first time, gradient-induced feature drift is leveraged as the core discriminative signal, revealing systematic differences between memorized and non-memorized samples in terms of gradient geometry and representational stability, thereby establishing a mechanistic link between memorization and privacy risk. By combining directional projection of representation changes with a lightweight logistic regression classifier, the method significantly outperforms existing attacks based on confidence, perplexity, or reference models across multiple Transformer architectures and real-world MIA benchmarks.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are trained on massive web-scale corpora, raising growing concerns about privacy and copyright. Membership inference attacks (MIAs) aim to determine whether a given example was used during training. Existing LLM MIAs largely rely on output probabilities or loss values and often perform only marginally better than random guessing when members and non-members are drawn from the same distribution. We introduce G-Drift MIA, a white-box membership inference method based on gradient-induced feature drift. Given a candidate (x,y), we apply a single targeted gradient-ascent step that increases its loss and measure the resulting changes in internal representations, including logits, hidden-layer activations, and projections onto fixed feature directions, before and after the update. These drift signals are used to train a lightweight logistic classifier that effectively separates members from non-members. Across multiple transformer-based LLMs and datasets derived from realistic MIA benchmarks, G-Drift substantially outperforms confidence-based, perplexity-based, and reference-based attacks. We further show that memorized training samples systematically exhibit smaller and more structured feature drift than non-members, providing a mechanistic link between gradient geometry, representation stability, and memorization. In general, our results demonstrate that small, controlled gradient interventions offer a practical tool for auditing the membership of training-data and assessing privacy risks in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Membership Inference

Large Language Models

Privacy

Feature Drift

Training Data

Innovation

Methods, ideas, or system contributions that make the work stand out.

membership inference attack

gradient-induced feature drift

large language models