Learning to Detect Language Model Training Data via Active Reconstruction

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing membership inference attacks against large language models predominantly rely on static model weights, limiting their effectiveness. This work proposes Active Data Reconstruction Attack (ADRA), the first approach to integrate reinforcement learning with active reconstruction for membership inference. ADRA fine-tunes the target model via policy gradient to induce it to reconstruct candidate texts, leveraging the principle that training data are more readily reconstructible than non-member data. The method introduces a learnable reconstruction metric and a contrastive reward mechanism to guide the attack process. Evaluated across pretraining, post-training, and distilled data detection tasks, ADRA outperforms the current state-of-the-art by an average of 10.7%, achieving performance gains of 18.8% and 7.6% on the BookMIA and AIME datasets, respectively.

Technology Category

Application Category

📝 Abstract

Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, \textsc{ADRA} and its adaptive variant \textsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.

Problem

Research questions and friction points this paper is trying to address.

membership inference attack

training data detection

large language models

data reconstruction

privacy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active Data Reconstruction Attack

Membership Inference Attack

Reinforcement Learning