From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of copyright disputes and benchmark contamination in pretraining data detection for large language models by proposing Gradient Deviation Scoring (GDS). GDS systematically leverages gradient dynamics during training—specifically, the evolution of sample familiarity—to perform membership inference based on characteristics such as the magnitude and position of gradient updates and the concentration of neuron activations. By constructing lightweight binary classifiers from gradient profiles of Feed-Forward Network (FFN) and Attention modules, the method achieves state-of-the-art performance across five public datasets, significantly outperforming existing baselines. Moreover, GDS demonstrates superior cross-dataset generalization and enhanced interpretability, offering a principled and effective approach to identifying whether a given sample was part of a model’s training data.

Technology Category

Application Category

📝 Abstract
Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical features or heuristic signals before and after fine-tuning, but the former are susceptible to word frequency bias in corpora, and the latter strongly depend on the similarity of fine-tuning data. From an optimization perspective, we observe that during training, samples transition from unfamiliar to familiar in a manner reflected by systematic differences in gradient behavior. Familiar samples exhibit smaller update magnitudes, distinct update locations in model components, and more sharply activated neurons. Based on this insight, we propose GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples. Specifically, we first represent each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data. These features are then fed into a lightweight classifier to perform binary membership inference. Experiments on five public datasets show that GDS achieves state-of-the-art performance with significantly improved cross-dataset transferability over strong baselines. Further interpretability analyse show gradient feature distribution differences, enabling practical and scalable pre-training data detection.
Problem

Research questions and friction points this paper is trying to address.

pre-training data detection
large language models
copyright concerns
benchmark contamination
membership inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient Deviation
Pre-training Data Detection
Membership Inference
Large Language Models
Optimization Dynamics
🔎 Similar Papers
No similar papers found.
R
Ruiqi Zhang
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University; School of Artificial Intelligence, Beihang University
Lingxiang Wang
Lingxiang Wang
Beihang university
NLP
Hainan Zhang
Hainan Zhang
Beihang University
Dialogue GenerationText GenerationFederated LearningNatural Language Processing
Z
Zhiming Zheng
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University; School of Artificial Intelligence, Beihang University
Yanyan Lan
Yanyan Lan
Tsinghua University
Information RetrievalMachine LearningAI4Science