🤖 AI Summary
Evaluating memory leakage in large language models (LLMs) remains challenging due to the opacity of their internal memorization behavior. Method: This paper proposes a black-box prompt optimization framework wherein an adversarial LLM agent iteratively generates instruction-style prompts to efficiently elicit training data from a target model—departing from conventional prefix-suffix prompting by introducing an LLM-driven instruction generation mechanism that uncovers implicit memorization. Contribution/Results: The approach reveals that instruction-tuned models may leak more pretraining data than base models and demonstrates that non-original contexts can also trigger memorized content retrieval. Experiments show a 23.7% increase in training-data overlap in model outputs under the optimized instructions. Furthermore, the framework validates the feasibility and generalizability of cross-model automated memory probing attacks, establishing a scalable methodology for assessing memorization risks in LLMs.
📝 Abstract
In this paper, we introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent, compared to what is revealed by prompting the target model with the training data directly, which is the dominant approach of quantifying memorization in LLMs. We use an iterative rejection-sampling optimization process to find instruction-based prompts with two main characteristics: (1) minimal overlap with the training data to avoid presenting the solution directly to the model, and (2) maximal overlap between the victim model's output and the training data, aiming to induce the victim to spit out training data. We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements. Our findings show that (1) instruction-tuned models can expose pre-training data as much as their base-models, if not more so, (2) contexts other than the original training data can lead to leakage, and (3) using instructions proposed by other LLMs can open a new avenue of automated attacks that we should further study and explore. The code can be found at https://github.com/Alymostafa/Instruction_based_attack .