In-Context Probing for Membership Inference in Fine-Tuned Language Models

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-tuned language models are vulnerable to membership inference attacks (MIAs), yet existing black-box methods—relying on confidence or likelihood scores—are susceptible to interference from intrinsic sample properties, suffering from poor generalizability and low signal-to-noise ratios. This paper proposes ICP-MIA, the first training-free, black-box MIA framework grounded in training dynamics theory. It introduces “optimization gap” as a core membership signal, estimated efficiently via context probing (ICP), which incorporates both reference-data alignment and self-perturbation strategies. Crucially, it formalizes the diminishing-returns phenomenon during convergence as a discriminative criterion, integrating semantic retrieval, token-level masking/generation perturbations, and PEFT-sensitivity analysis. Evaluated across multiple tasks and large language models, ICP-MIA significantly outperforms state-of-the-art methods: at FPR < 1%, it achieves 12–28% higher AUC. The study further identifies critical factors governing attack efficacy: reference-data alignment quality, model scale, and PEFT variant type.

Technology Category

Application Category

📝 Abstract
Membership inference attacks (MIAs) pose a critical privacy threat to fine-tuned large language models (LLMs), especially when models are adapted to domain-specific tasks using sensitive data. While prior black-box MIA techniques rely on confidence scores or token likelihoods, these signals are often entangled with a sample's intrinsic properties - such as content difficulty or rarity - leading to poor generalization and low signal-to-noise ratios. In this paper, we propose ICP-MIA, a novel MIA framework grounded in the theory of training dynamics, particularly the phenomenon of diminishing returns during optimization. We introduce the Optimization Gap as a fundamental signal of membership: at convergence, member samples exhibit minimal remaining loss-reduction potential, while non-members retain significant potential for further optimization. To estimate this gap in a black-box setting, we propose In-Context Probing (ICP), a training-free method that simulates fine-tuning-like behavior via strategically constructed input contexts. We propose two probing strategies: reference-data-based (using semantically similar public samples) and self-perturbation (via masking or generation). Experiments on three tasks and multiple LLMs show that ICP-MIA significantly outperforms prior black-box MIAs, particularly at low false positive rates. We further analyze how reference data alignment, model type, PEFT configurations, and training schedules affect attack effectiveness. Our findings establish ICP-MIA as a practical and theoretically grounded framework for auditing privacy risks in deployed LLMs.
Problem

Research questions and friction points this paper is trying to address.

Membership inference attacks threaten privacy in fine-tuned language models.
Prior methods struggle with poor generalization and low signal-to-noise ratios.
Proposes a framework using training dynamics to improve attack effectiveness.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses optimization gap as membership signal
Proposes training-free in-context probing method
Implements reference-data and self-perturbation strategies