G-Drift MIA: Membership Inference via Gradient-Induced Feature Drift in LLMs

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited effectiveness of existing membership inference attacks (MIAs) on large language models when members and non-members are drawn from the same distribution, which hinders accurate assessment of training data privacy risks. The authors propose a white-box MIA that perturbs inputs via single-step targeted gradient ascent and measures resulting shifts in internal model representations—such as logits and hidden-layer activations. For the first time, gradient-induced feature drift is leveraged as the core discriminative signal, revealing systematic differences between memorized and non-memorized samples in terms of gradient geometry and representational stability, thereby establishing a mechanistic link between memorization and privacy risk. By combining directional projection of representation changes with a lightweight logistic regression classifier, the method significantly outperforms existing attacks based on confidence, perplexity, or reference models across multiple Transformer architectures and real-world MIA benchmarks.
📝 Abstract
Large language models (LLMs) are trained on massive web-scale corpora, raising growing concerns about privacy and copyright. Membership inference attacks (MIAs) aim to determine whether a given example was used during training. Existing LLM MIAs largely rely on output probabilities or loss values and often perform only marginally better than random guessing when members and non-members are drawn from the same distribution. We introduce G-Drift MIA, a white-box membership inference method based on gradient-induced feature drift. Given a candidate (x,y), we apply a single targeted gradient-ascent step that increases its loss and measure the resulting changes in internal representations, including logits, hidden-layer activations, and projections onto fixed feature directions, before and after the update. These drift signals are used to train a lightweight logistic classifier that effectively separates members from non-members. Across multiple transformer-based LLMs and datasets derived from realistic MIA benchmarks, G-Drift substantially outperforms confidence-based, perplexity-based, and reference-based attacks. We further show that memorized training samples systematically exhibit smaller and more structured feature drift than non-members, providing a mechanistic link between gradient geometry, representation stability, and memorization. In general, our results demonstrate that small, controlled gradient interventions offer a practical tool for auditing the membership of training-data and assessing privacy risks in LLMs.
Problem

Research questions and friction points this paper is trying to address.

Membership Inference
Large Language Models
Privacy
Feature Drift
Training Data
Innovation

Methods, ideas, or system contributions that make the work stand out.

membership inference attack
gradient-induced feature drift
large language models
white-box auditing
representation stability
🔎 Similar Papers
No similar papers found.
R
Ravi Ranjan
Florida International University
U
Utkarsh Grover
University of South Florida
Xiaomin Lin
Xiaomin Lin
Assistant Prof, University of South Florida
AI for goodRobotics for scienceRobotics for good
A
Agoritsa Polyzou
Florida International University